# ch11 金融和经济数据应用
在本章示例中，将使用术语“截面”（cross-section）来表示某个时间点的数据。例如，标准普尔500指数中所有成分股在特定日期的收盘价就形成了一个截面。多个数据项（例如价格和成交量）在多个时间点的截面数据就构成了一个面板（panel）。面板数据既可以被表示为层次化索引的DataFrame，也可以被表示为三维的Panel pandas对象。
## 11.1数据规整化方面的话题
### 11.1.1时间序列以及截面对齐
数据对齐（data alignment）问题是在处理金融数据时最费神的问题，pandas可以在算数运算中自动对齐数据，在实际工作中，不仅可以带来极大的自由度，还能提高工作效率，下面两个DataFrame，分别含有股票价格和成交量的时间序列：

In [1]:
import pandas as pd
prices = pd.DataFrame({
    'AAPL':[379.74,383.93,384.14,377.48,379.94,384.62,389.30],
    'JNJ':[64.64,65.43,64.95,63.64,63.59,63.61,63.73],
    'SPX':[1165.24,1198.62,1185.90,1154.23,1162.27,1172.87,1188.68],
    'XOM':[71.15,73.65,72.82,71.01,71.84,71.65,72.64]
},index=['2011-09-06','2011-09-07','2011-09-08','2011-09-09','2011-09-12','2011-09-13','2011-09-14'])

volume = pd.DataFrame({
    'AAPL':[18173500,12492000,14839800,20171900,16697300],
    'JNJ':[15848300,10759700,15551500,17008200,13448200],
    'XOM':[25416300,23108400,22434800,27969100,26205800]
},index=['2011-09-06','2011-09-07','2011-09-08','2011-09-09','2011-09-12'])

In [2]:
prices

Unnamed: 0,AAPL,JNJ,SPX,XOM
2011-09-06,379.74,64.64,1165.24,71.15
2011-09-07,383.93,65.43,1198.62,73.65
2011-09-08,384.14,64.95,1185.9,72.82
2011-09-09,377.48,63.64,1154.23,71.01
2011-09-12,379.94,63.59,1162.27,71.84
2011-09-13,384.62,63.61,1172.87,71.65
2011-09-14,389.3,63.73,1188.68,72.64


In [3]:
volume

Unnamed: 0,AAPL,JNJ,XOM
2011-09-06,18173500,15848300,25416300
2011-09-07,12492000,10759700,23108400
2011-09-08,14839800,15551500,22434800
2011-09-09,20171900,17008200,27969100
2011-09-12,16697300,13448200,26205800


In [4]:
prices * volume # 自动对齐

Unnamed: 0,AAPL,JNJ,SPX,XOM
2011-09-06,6901205000.0,1024434000.0,,1808370000.0
2011-09-07,4796054000.0,704007200.0,,1701934000.0
2011-09-08,5700561000.0,1010070000.0,,1633702000.0
2011-09-09,7614489000.0,1082402000.0,,1986086000.0
2011-09-12,6343972000.0,855171000.0,,1882625000.0
2011-09-13,,,,
2011-09-14,,,,


In [5]:
vwap = (prices * volume).sum()/volume.sum()
vwap.round(2)

  result = _values_from_object(self).round(decimals)


AAPL    380.66
JNJ      64.39
SPX        NaN
XOM      72.02
dtype: float64

In [6]:
vwap.dropna().round(2) # 可以显示地将SPX丢弃

AAPL    380.66
JNJ      64.39
XOM      72.02
dtype: float64

如果需要手工进行对齐，可以使用DataFrame的align方法，它返回的是一个元组，含有两个对象的重索引版本：

In [7]:
prices.align(volume,join='inner')

(              AAPL    JNJ    XOM
 2011-09-06  379.74  64.64  71.15
 2011-09-07  383.93  65.43  73.65
 2011-09-08  384.14  64.95  72.82
 2011-09-09  377.48  63.64  71.01
 2011-09-12  379.94  63.59  71.84,                 AAPL       JNJ       XOM
 2011-09-06  18173500  15848300  25416300
 2011-09-07  12492000  10759700  23108400
 2011-09-08  14839800  15551500  22434800
 2011-09-09  20171900  17008200  27969100
 2011-09-12  16697300  13448200  26205800)

另一个功能是，通过一组索引可能不同的Series构建一个DataFrame：

In [8]:
s1 = pd.Series(range(3),index=['a','b','c'])
s2 = pd.Series(range(4),index=['d','b','c','e'])
s3 = pd.Series(range(3),index=['f','a','c'])
pd.DataFrame({'one':s1,'two':s2,'three':s3})

Unnamed: 0,one,three,two
a,0.0,1.0,
b,1.0,,1.0
c,2.0,2.0,2.0
d,,,0.0
e,,,3.0
f,,0.0,


In [9]:
# 也可以显式定义结果的索引
pd.DataFrame({'one':s1,'two':s2,'three':s3},index = list('face'))

Unnamed: 0,one,three,two
f,,0.0,
a,0.0,1.0,
c,2.0,2.0,2.0
e,,,3.0


### 11.1.2频率不同的时间序列的运算
频率转换和重对齐的两大主要工具是resample和reindex方法：
+ resample用于将数据转换到固定频率；
+ reindex用于使数据符合一个新索引；

它们都支持插值（如前向填充）逻辑。

In [10]:
# 周型时间序列
import numpy as np
ts1 = pd.Series(np.random.randn(3),
            index = pd.date_range('2012-6-13',periods = 3,freq = 'W-WED'))
ts1

2012-06-13    0.353840
2012-06-20    0.409834
2012-06-27   -0.795460
Freq: W-WED, dtype: float64

In [11]:
# 如果将其重采样到工作日（周一到周五）频率，则那些没有数据的日子就会出现一个“空洞”：
#help(pd.DataFrame.resample)
#ts1.resample('B').asfreq()[0:25]#select first 25 rows
ts1.resample('B').sum()

2012-06-13    0.353840
2012-06-14         NaN
2012-06-15         NaN
2012-06-18         NaN
2012-06-19         NaN
2012-06-20    0.409834
2012-06-21         NaN
2012-06-22         NaN
2012-06-25         NaN
2012-06-26         NaN
2012-06-27   -0.795460
Freq: B, dtype: float64

In [12]:
#用前面的值填充空白
#处理较低频率的数据时常常这么干，因为最终结果中各时间点都有一个最新的有效值：
ts1.resample('B').ffill()

2012-06-13    0.353840
2012-06-14    0.353840
2012-06-15    0.353840
2012-06-18    0.353840
2012-06-19    0.353840
2012-06-20    0.409834
2012-06-21    0.409834
2012-06-22    0.409834
2012-06-25    0.409834
2012-06-26    0.409834
2012-06-27   -0.795460
Freq: B, dtype: float64

将较低频率的数据升采样到较高的规整频率是一种不错的解决方案，但是对于更一般化的不规整时间序列可能就不太合适了。下例为不规整样本的时间序列：

In [13]:
dates = pd.DatetimeIndex(['2012-6-12','2012-6-17','2012-6-18',
                         '2012-6-21','2012-6-22','2012-6-29'])
ts2 = pd.Series(np.random.randn(6),index=dates)
ts2

2012-06-12   -0.291311
2012-06-17    0.828706
2012-06-18   -1.860435
2012-06-21   -1.157795
2012-06-22    0.531472
2012-06-29    0.170498
dtype: float64

如果要将ts1中“最当前”的值（即向前填充）加到ts2上，一个办法是将两者重采样为规整频率后再相加，但是如果想维持ts2中的日期索引，则reindex会是一种更好的解决方案：

In [14]:
ts1.reindex(ts2.index,method = 'ffill')

2012-06-12         NaN
2012-06-17    0.353840
2012-06-18    0.353840
2012-06-21    0.409834
2012-06-22    0.409834
2012-06-29   -0.795460
dtype: float64

In [15]:
ts2 + ts1.reindex(ts2.index,method = 'ffill')

2012-06-12         NaN
2012-06-17    1.182546
2012-06-18   -1.506595
2012-06-21   -0.747961
2012-06-22    0.941306
2012-06-29   -0.624963
dtype: float64

### 11.1.3使用Period
Period（表示时间区间）提供了另一种处理不同频率时间序列的办法，尤其是那些有着特殊规范的以年或季度为频率的金融或经济序列。
来看两个有关GDP和通货膨胀的宏观经济时间序列：

In [16]:
gdp = pd.Series([1.78,1.94,2.08,2.01,2.15,2.31,2.46],
               index = pd.period_range('1984Q2',periods = 7,freq='Q-SEP'))
gdp

1984Q2    1.78
1984Q3    1.94
1984Q4    2.08
1985Q1    2.01
1985Q2    2.15
1985Q3    2.31
1985Q4    2.46
Freq: Q-SEP, dtype: float64

In [17]:
infl = pd.Series([0.025,0.045,0.037,0.04],
                index = pd.period_range(1982,periods=4,freq='A-DEC'))
infl

1982    0.025
1983    0.045
1984    0.037
1985    0.040
Freq: A-DEC, dtype: float64

跟Timestamp的时间序列不同，由Period索引的两个不同频率的时间序列之间的运算必须进行显式转换。在本例中，假设已知infl值是在每年年末观测的，于是我们就可以将其转换到Q-SEP以得到该频率下的正确时期：

In [18]:
infl_q = infl.asfreq('Q-SEP',how = 'end')
infl_q

1983Q1    0.025
1984Q1    0.045
1985Q1    0.037
1986Q1    0.040
Freq: Q-SEP, dtype: float64

然后这个时间序列就可以被重索引了（使用向前填充以匹配gdp）：

In [19]:
infl_q.reindex(gdp.index,method='ffill')

1984Q2    0.045
1984Q3    0.045
1984Q4    0.045
1985Q1    0.037
1985Q2    0.037
1985Q3    0.037
1985Q4    0.037
Freq: Q-SEP, dtype: float64

### 11.1.4时间和“最当前”数据选取
假设有一个很长的盘中市场数据时间序列，现在希望抽取其中每天特定时间的价格数据。如果数据不规整（观测值没有精确地落在期望的时间点上），怎么办：

In [20]:
# 生成一个交易日内的日期范围和时间序列
rng = pd.date_range('2012-06-01 09:30','2012-06-01 15:59',freq='T')
# 生成5天的时间点
rng = rng.append([rng+pd.offsets.BDay(i) for i in range(1,4)])
ts = pd.Series(np.arange(len(rng),dtype = float),index=rng)
ts[-10:]

2012-06-06 15:50:00    1550.0
2012-06-06 15:51:00    1551.0
2012-06-06 15:52:00    1552.0
2012-06-06 15:53:00    1553.0
2012-06-06 15:54:00    1554.0
2012-06-06 15:55:00    1555.0
2012-06-06 15:56:00    1556.0
2012-06-06 15:57:00    1557.0
2012-06-06 15:58:00    1558.0
2012-06-06 15:59:00    1559.0
dtype: float64

利用Python的datetime.time对象进行索引即可抽取出这些时间点上的值：

In [21]:
from datetime import time
ts[time(10,0)]
# 实际上，该操作用到了实例方法at_time（各时间序列以及类似的DataFrame对象都有）
ts.at_time(time(10,0))

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1200.0
dtype: float64

还有一个between_time方法，它用于选取两个Time对象之间的值：

In [22]:
ts.between_time(time(10,0),time(10,1))

2012-06-01 10:00:00      30.0
2012-06-01 10:01:00      31.0
2012-06-04 10:00:00     420.0
2012-06-04 10:01:00     421.0
2012-06-05 10:00:00     810.0
2012-06-05 10:01:00     811.0
2012-06-06 10:00:00    1200.0
2012-06-06 10:01:00    1201.0
dtype: float64

正如刚才提到的那样，可能刚好就没有任何数据落在某个具体的时间上（比如上午10点）。这时，可能会希望得到上午10点之前最后出现的那个值：

In [23]:
# 将该时间序列的大部分内容随机设置为NA
indexer = np.sort(np.random.permutation(len(ts))[700:])
irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-04 09:50':'2012-06-04 10:10']

2012-06-04 09:50:00    410.0
2012-06-04 09:51:00      NaN
2012-06-04 09:52:00      NaN
2012-06-04 09:53:00      NaN
2012-06-04 09:54:00    414.0
2012-06-04 09:55:00      NaN
2012-06-04 09:56:00    416.0
2012-06-04 09:57:00    417.0
2012-06-04 09:58:00      NaN
2012-06-04 09:59:00      NaN
2012-06-04 10:00:00      NaN
2012-06-04 10:01:00    421.0
2012-06-04 10:02:00      NaN
2012-06-04 10:03:00      NaN
2012-06-04 10:04:00      NaN
2012-06-04 10:05:00    425.0
2012-06-04 10:06:00      NaN
2012-06-04 10:07:00      NaN
2012-06-04 10:08:00      NaN
2012-06-04 10:09:00    429.0
2012-06-04 10:10:00    430.0
dtype: float64

如果将一组Timestamp传入asof方法，就能得到这些时间点处（或者之前最近）的有效值（非NA）。例如我们构造一个日期范围（每天上午10点），然后将其传入asof：

In [24]:
selection = pd.date_range('2012-06-01 10:00',periods=4,freq='B')
selection

DatetimeIndex(['2012-06-01 10:00:00', '2012-06-04 10:00:00',
               '2012-06-05 10:00:00', '2012-06-06 10:00:00'],
              dtype='datetime64[ns]', freq='B')

In [25]:
irr_ts.asof(selection)

2012-06-01 10:00:00      28.0
2012-06-04 10:00:00     417.0
2012-06-05 10:00:00     807.0
2012-06-06 10:00:00    1200.0
Freq: B, dtype: float64

### 11.1.5拼接多个数据源

在金融或经济领域中经常出现的情况：
+ 在一个特定的时间点上，从一个数据源切换到另一个数据源；
+ 用另一个时间序列对当前时间序列中缺失值“打补丁”；
+ 将数据中的符号（国家、资产代码等）替换为实际数据。

对于第一种情况，在特定时刻从一个时间序列切换到另一个，其实就是用pandas.concat将两个TimeSeries或DataFrame对象合并到一起：

In [26]:
data1 = pd.DataFrame(np.ones((6,3),dtype=float),
                    columns = ['a','b','c'],
                    index = pd.date_range('6/12/2012',periods = 6))
data2 = pd.DataFrame(np.ones((6,3),dtype=float) * 2,
                    columns = ['a','b','c'],
                    index = pd.date_range('6/13/2012',periods = 6))
data2

Unnamed: 0,a,b,c
2012-06-13,2.0,2.0,2.0
2012-06-14,2.0,2.0,2.0
2012-06-15,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0


In [27]:
#.ix is deprecated. Please use
#.loc for label based indexing or
#.iloc for positional indexing
spliced = pd.concat([data1.loc[:'2012-06-14'],data2.loc['2012-06-15':]])
spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,1.0,1.0,1.0
2012-06-14,1.0,1.0,1.0
2012-06-15,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0


再看另一个简单地例子，假设data1缺失了data2中存在的某个时间序列：

In [28]:
data2 = pd.DataFrame(np.ones((6,4),dtype=float) * 2,
                    columns = ['a','b','c','d'],
                    index = pd.date_range('6/13/2012',periods = 6))
data2

Unnamed: 0,a,b,c,d
2012-06-13,2.0,2.0,2.0,2.0
2012-06-14,2.0,2.0,2.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [29]:
spliced = pd.concat([data1.loc[:'2012-06-14'],data2.loc['2012-06-15':]])
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,
2012-06-14,1.0,1.0,1.0,
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


conbine_first可以引入合并点之前的数据，这样也就扩展了‘d’项的历史：

In [30]:
spliced_filled = spliced.combine_first(data2)
spliced_filled

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


DataFrame也有一个类似的方法update，它可以实现就地更新。如果只想填充空洞，则必须传入overwrite=False才行：

In [31]:
spliced.update(data2,overwrite = False)

In [32]:
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


上面所讲的这些技术都可以实现将数据中的符号替换为实际数据，但有时利用DataFrame的索引机制直接对列进行设置会更简单一些：

In [33]:
cp_spliced = spliced.copy()
cp_spliced[['a','c']] = data1[['a','c']]
cp_spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,1.0,2.0,1.0,2.0
2012-06-16,1.0,2.0,1.0,2.0
2012-06-17,1.0,2.0,1.0,2.0
2012-06-18,,2.0,,2.0


### 11.1.6收益指数和累计收益
在金融领域中，收益（return）通常指的是某资产价格的百分比变化。

下例为2011年到2012年间苹果公司的股票价格数据：

In [34]:
#import pandas.io.data as web
from pandas_datareader import data
price = data.get_data_yahoo('AAPL','2011-01-01')['Adj Close']

In [35]:
price = price[:'2012-07-27']
price[-5:]

Date
2012-07-23    77.327621
2012-07-24    76.954979
2012-07-25    73.631767
2012-07-26    73.620232
2012-07-27    74.936714
Name: Adj Close, dtype: float64

对于苹果公司的股票（没有股息），计算两个时间点之间的累计百分比回报只需计算价格的百分比变化即可：

In [36]:
price['2011-10-03']/price['2011-3-01']-1 # pct_change

0.072399849856854992

对于其他派发股息的股票，要计算在某只股票上赚了多少钱就比较复杂，不过，这里所使用的已调整收盘价已经对拆分和股息作出了调整。不管什么样的情况，通常都会先算出一个收益指数，它是一个表示单位投资（比如1美元）收益的时间序列。从收益指数中可以得出很多假设。

例如，可以决定是否进行利润再投资。对于苹果公司的情况，可以利用cumprod计算出一个简单的收益指数：

In [37]:
returns = price.pct_change() # 	计算百分数变化 
#help(pd.Series.pct_change)
ret_index = (1 + returns).cumprod()
ret_index[0] = 1
ret_index[:5]

Date
2010-12-31    1.000000
2011-01-03    1.021732
2011-01-04    1.027065
2011-01-05    1.035466
2011-01-06    1.034629
Name: Adj Close, dtype: float64

得到收益指数之后，计算指定时期内的累计收益就很简单了：

In [38]:
m_returns = ret_index.resample('BM').last().pct_change()
m_returns['2012']

Date
2012-01-31    0.127111
2012-02-29    0.188311
2012-03-30    0.105283
2012-04-30   -0.025970
2012-05-31   -0.010702
2012-06-29    0.010853
2012-07-31    0.001986
Freq: BM, Name: Adj Close, dtype: float64

就这个简单例子而言（没有股息也没有其他需要考虑的调整），上面的结果也能通过重采样聚合（这里聚合为时期）从日百分比变化中计算得出：

In [39]:
m_rets = (1 + returns).resample('M',kind='period').prod() - 1
m_rets['2012']

Date
2012-01    0.127111
2012-02    0.188311
2012-03    0.105283
2012-04   -0.025970
2012-05   -0.010702
2012-06    0.010853
2012-07    0.001986
Freq: M, Name: Adj Close, dtype: float64

如果知道了股息的派发日和支付率，就可以将它们计入到每日收益中，如下所示：

## 11.2分组变换和分析

以一组假想的股票投资组合为例，首先随机生成1000个股票代码：

In [66]:
import random;random.seed(0)
import string
N = 1000
def rands(n):
    choices = string.ascii_uppercase # 生成大写字母表
    #help(string.join) 语法：  'sep'.join(seq) 以sep作为分隔符，将seq所有的元素合并成一个新的字符串
    #help(random.choice)
    return ''.join([random.choice(choices) for a in xrange(n)]) #xrange生成的不是一个数组，而是一个生成器。
#list(xrange(5))
tickers = np.array([rands(5) for a in xrange(N)])
tickers[:5]

array(['VTKGN', 'KUHMP', 'XNHTQ', 'GXZVX', 'ISXRM'],
      dtype='|S5')

然后创建一个含有3列的DataFrame来承载这些假想数据，不过只选择部分股票组成该投资组合：

In [78]:
M = 500
df = pd.DataFrame({'Momentum' : np.random.randn(M)/200 + 0.03,
                  'Value' : np.random.randn(M)/200 + 0.08,
                  'ShortInterest' : np.random.randn(M)/200 - 0.02},
                 index = tickers[:M])
df[:5]

Unnamed: 0,Momentum,ShortInterest,Value
VTKGN,0.027625,-0.025925,0.082719
KUHMP,0.028375,-0.012996,0.080333
XNHTQ,0.030507,-0.022311,0.083494
GXZVX,0.034514,-0.024332,0.083852
ISXRM,0.031657,-0.028001,0.076246


接下来，为这些股票随机创建一个行业分类。为了简单起见，只选用两个行业，并将映射关系保存在Series中：

In [84]:
ind_names = np.array(['FINANCIAL','TECH'])
sampler = np.random.randint(0,len(ind_names),N)
industries = pd.Series(ind_names[sampler],index=tickers,name = 'industry')
industries[:5]

VTKGN         TECH
KUHMP         TECH
XNHTQ    FINANCIAL
GXZVX    FINANCIAL
ISXRM         TECH
Name: industry, dtype: object

现在根据行业分类进行分组并执行分组聚合和变换：

In [86]:
by_industry = df.groupby(industries)
by_industry.mean()

Unnamed: 0_level_0,Momentum,ShortInterest,Value
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FINANCIAL,0.02982,-0.020313,0.080167
TECH,0.030029,-0.019565,0.079697


In [88]:
by_industry['Momentum'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
FINANCIAL,251.0,0.02982,0.005618,0.015774,0.02614,0.030012,0.033277,0.046707
TECH,249.0,0.030029,0.005002,0.019303,0.026057,0.029874,0.033484,0.044536


要对这些按行业分组的投资组合进行各种变换，可以编写自定义的变换函数。例如行业内标准化处理，广泛用于股票资产投资组合的构建过程：

In [102]:
# 行业内标准化处理
def zscore(group):
    return (group - group.mean()) / group.std()

df_stand = by_industry.apply(zscore)
df_stand[:5]

Unnamed: 0,Momentum,ShortInterest,Value
VTKGN,-0.480673,-1.295103,0.578231
KUHMP,-0.330602,1.337638,0.121551
XNHTQ,0.122339,-0.38725,0.656809
GXZVX,0.835547,-0.778984,0.727591
ISXRM,0.325371,-1.717876,-0.660431


这样处理之后，各行业的平均值为0，标准差为1：

In [103]:
df_stand.groupby(industries).agg(['mean','std']) # 面向列的多函数应用

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,mean,std,mean,std,mean,std
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,-9.13833e-16,1.0,9.315258e-16,1.0,9.953083e-15,1.0
TECH,-5.675959e-16,1.0,7.633341e-16,1.0,4.695931e-15,1.0


内置变换函数（如rank）的用法会更简洁一些：

In [104]:
# 行业内降序排名
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min','max'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,min,max,min,max,min,max
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.0,251.0,1.0,251.0,1.0,251.0
TECH,1.0,249.0,1.0,249.0,1.0,249.0


在股票投资组合的定量分析中，“排名和标准化”是一种很常见的变换运算组合。通过将rank和zscore链接在一起即可完成整个变换过程：

In [109]:
# 行业内排名和标准化
by_industry.apply(lambda x:zscore(x.rank()))[:10]

Unnamed: 0,Momentum,ShortInterest,Value
VTKGN,-0.555368,-1.360652,0.749747
KUHMP,-0.416526,1.416189,0.249916
XNHTQ,0.206607,-0.50963,0.812654
GXZVX,1.101903,-0.977939,0.881523
ISXRM,0.444294,-1.568915,-0.846936
CLPXZ,0.569252,1.180157,-0.013884
MWGUO,1.207926,-1.555031,-1.152389
ASKVR,-0.702463,-0.550952,-1.046808
AMWGI,-1.404927,-0.674916,0.055095
WEOGZ,0.541484,-1.305115,0.763631


### 11.2.1 分组因子暴露
因子分析（factor analysis）是投资组合定量管理中的一种技术。投资组合的持有量和性能（收益与损失）可以被分解为一个或多个表示投资组合权重的因子（风险因子就是其中之一）。例如，某只股票的价格与某个基准（比如标准普尔500指数）的协动性被称作其贝塔风险系数（beta，一种常见的风险因子）。下面以一个人为构成的投资组合为例进行讲解，它由三个随机生成的因子（通常称为因子载荷）和一些权重构成：

In [110]:
from numpy.random import rand
fac1,fac2,fac3 = np.random.rand(3,1000)
ticker_subset = tickers.take(np.random.permutation(N)[:1000])
ticker_subset[:20]

array(['GPSUE', 'OHESH', 'MUKPW', 'GSGGY', 'ITFQZ', 'HKWNO', 'IEKXL',
       'MFBXC', 'FFYPB', 'HVRUB', 'VTKGN', 'XZREU', 'WBFBA', 'EVOIO',
       'ZMLDR', 'FNNVN', 'GTXLR', 'PADIU', 'PIXCU', 'YUEKT'],
      dtype='|S5')

In [113]:
# 因子加权和以及噪声
port= pd.Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 +rand(1000),
               index = ticker_subset)
factors = pd.DataFrame({'f1':fac1,'f2':fac2,'f3':fac3},index = ticker_subset)

各因子与投资组合之间的矢量相关性可能说明不了什么问题：

In [115]:
factors.corrwith(port)

f1    0.390998
f2   -0.699533
f3    0.162684
dtype: float64

计算因子暴露的标准方式是最小二乘回归。使用pandas.ols(将factors作为解释变量）即可计算出整个投资组合的暴露：

In [130]:
import statsmodels.api as sm
model = sm.OLS(port,factors)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.736
Model:,OLS,Adj. R-squared:,0.736
Method:,Least Squares,F-statistic:,928.1
Date:,"Thu, 21 Dec 2017",Prob (F-statistic):,5.37e-288
Time:,13:37:21,Log-Likelihood:,-313.27
No. Observations:,1000,AIC:,632.5
Df Residuals:,997,BIC:,647.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
f1,1.0331,0.031,33.308,0.000,0.972,1.094
f2,-0.9413,0.031,-30.775,0.000,-1.001,-0.881
f3,0.6171,0.031,20.068,0.000,0.557,0.677

0,1,2,3
Omnibus:,44.079,Durbin-Watson:,1.961
Prob(Omnibus):,0.0,Jarque-Bera (JB):,19.26
Skew:,-0.063,Prob(JB):,6.57e-05
Kurtosis:,2.332,Cond. No.,3.27


In [133]:
from statsmodels.formula.api import ols

In [142]:
ols('port~factors',factors).fit().summary()

0,1,2,3
Dep. Variable:,port,R-squared:,0.686
Model:,OLS,Adj. R-squared:,0.685
Method:,Least Squares,F-statistic:,725.6
Date:,"Thu, 21 Dec 2017",Prob (F-statistic):,5.4999999999999994e-250
Time:,13:44:20,Log-Likelihood:,-162.94
No. Observations:,1000,AIC:,333.9
Df Residuals:,996,BIC:,353.5
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.5332,0.029,18.691,0.000,0.477,0.589
factors[0],0.7213,0.031,22.912,0.000,0.660,0.783
factors[1],-1.2592,0.031,-40.171,0.000,-1.321,-1.198
factors[2],0.3020,0.031,9.622,0.000,0.240,0.364

0,1,2,3
Omnibus:,574.476,Durbin-Watson:,1.987
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.027
Skew:,-0.04,Prob(JB):,4.14e-13
Kurtosis:,1.833,Cond. No.,6.18


In [135]:
port

GPSUE    0.311791
OHESH    0.314354
MUKPW   -0.299662
GSGGY   -0.715172
ITFQZ    0.894466
HKWNO    0.425192
IEKXL   -0.139556
MFBXC    0.801049
FFYPB    0.577301
HVRUB    1.174565
VTKGN   -0.120595
XZREU   -0.182414
WBFBA    0.077748
EVOIO   -0.464377
ZMLDR    0.989816
FNNVN    0.457903
GTXLR    1.341780
PADIU    0.328624
PIXCU    0.001301
YUEKT    0.646775
CMFKN    0.059372
QVFLE   -0.047544
AZDQM   -0.270544
QQSKK    0.137381
BXRTR   -0.571552
BLRAE    0.321411
WGPSV    0.306594
BUDVY    1.012109
XYXNW    0.442623
YFAWP    0.382938
           ...   
PTGHL    1.082649
PAYNN   -0.130292
HCLOX   -0.462041
WRWQF    0.921025
ZSSRZ    1.013411
QJEWY   -0.342299
LHNJU    0.952866
JRWTQ    0.066160
DYDFY    0.720880
XHFRG    0.550150
UTPFR    1.177658
OOLLE    0.243878
CBYGS    0.256092
ICUJG    0.461665
QXCVN    1.459588
TQKDF    0.645449
YAHRO   -0.167446
QLDPG    0.516153
JLESE    0.549367
INISX    0.942135
HHRYP    0.727417
WPECZ    0.466517
GBEAR    1.008052
ONSCX    0.196916
LCAIC    0