理论依据：

Fama 和 French 三因子模型：
$$
R_{i} - R_{f} = \alpha_{i} + \beta_{1i}(R_{m} - R_{f}) + \beta_{2i}SMB + \beta_{3i}HML + \epsilon_{i}
$$

Fama 和 French 五因子模型：
$$
R_{i} - R_{f}= \alpha_{i} + \beta_{1i}(R_{m} - R_{f}) + \beta_{2i}SMB + \beta_{3i}HML + \beta_{4i}RMW + \beta_{5i}CMA + \epsilon_{i}
$$

- MKT组合(模拟市场因子):  
大多数实证研究把市场组合看作由在美国股票交易所交易的证券集合构成  
由于市场因子已经通过市场指数被广泛接受和使用，因此没有必要通过因子复制策略来复制市场风险

- SMB组合(模拟规模因子):  
对象：CRSP数据库中所有的美国普通股  
组合构建时间：每年（记为t）6月  
分类标准：市值-直接从数据库中获得  
断点：NYSE股票市值中值  
SMB组合收益=低市值股票组合的收益-高市值股票组合的收益

- HML组合(模拟价值因子):  
对象：CRSP数据库中所有的美国普通股  
组合构建时间：每年（记为t）6月  
分类标准：账面市值比=t-1年财年年底的普通股账面价值/t-1年12月底的股票市值  
断点：NYSE股票账面市值比的第30和第70百分位数  
HML组合收益=高账面市值比股票组合的收益-低账面市值比股票组合的收益

- RMW组合(模拟盈利因子)：  
对象：CRSP数据库中所有的美国普通股  
组合构建时间：每年（记为t）6月  
分类标准：盈利能力比率-权益收益率=t-1年净利润/t-1年财年年底的总权益（数据表已直接提供）  
断点：NYSE股票权益收益率的第30和第70百分位数  
RMW组合收益=高权益收益率股票组合的收益-低权益收益率股票组合的收益

- CMA组合(模拟投资因子)：  
对象：CRSP数据库中所有的美国普通股  
组合构建时间：每年（记为t）6月  
分类标准：投资比率-总资产变化率：从t-2年结束的会计年度到t-1年结束的会计年度总资产的变化/除以t-2年总资产（数据表已直接提供）  
断点：NYSE股票总资产变化率的第30和第70百分位数  
CMA组合收益=低总资产变化率组合的收益-高总资产变化率股票组合的收益

代码实现过程:

一、前期准备

1.导入库包

In [93]:
import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf        #该模块允许使用类似R语言的公式语法来指定统计模型
from regtabletotext import prettify_result

2.数据提取：
  通过sqlite数据库读取表格数据，选择所需字段，并将date/datedate列解析为日期类型

In [94]:
tidy_finance = sqlite3.connect(
  database=r"C:\Users\86189\Desktop\For Students\tidy_finance.sqlite"
)

crsp_monthly = (pd.read_sql_query(
    sql=("SELECT permno, gvkey, date, ret_excess, mktcap,mktcap_lag, exchange FROM crsp_monthly"),#超额收益|市值|滞后市值
    con=tidy_finance,
    parse_dates={"date"})
)

compustat = (pd.read_sql_query(                                  
    sql="SELECT gvkey, datadate, be, op, inv FROM compustat",    #账面价值|盈利能力比率|投资比率
    con=tidy_finance,
    parse_dates={"datadate"})
)

factors_ff3_monthly = (pd.read_sql_query(
    sql="SELECT date, smb, hml FROM factors_ff3_monthly",        #SMB组合|HML组合
    con=tidy_finance,
    parse_dates={"date"}
))

factors_ff5_monthly = (pd.read_sql_query(
    sql=("SELECT date, smb, hml, rmw, cma "                      #SMB组合|HML组合|RMW组合|CMA组合
         "FROM factors_ff5_monthly"),
    con=tidy_finance,
    parse_dates={"date"}
))

crsp_monthly


Unnamed: 0,permno,gvkey,date,ret_excess,mktcap,mktcap_lag,exchange
0,10028,012096,1993-03-01,-0.102500,6.329250,7.032500,AMEX
1,10028,012096,1993-04-01,0.386489,8.790625,6.329250,AMEX
2,10028,012096,1993-05-01,0.197800,10.548750,8.790625,AMEX
3,10028,012096,1993-06-01,-0.135833,9.044750,10.548750,AMEX
4,10028,012096,1993-07-01,0.189908,10.784125,9.044750,AMEX
...,...,...,...,...,...,...,...
3326348,10042,012139,2005-02-01,-0.215192,23.583960,29.989479,AMEX
3326349,10042,012139,2005-03-01,-0.113211,21.029761,23.583960,AMEX
3326350,10042,012139,2005-04-01,-0.071544,19.569360,21.029761,AMEX
3326351,10042,012139,2005-05-01,0.102078,25.538140,19.569360,AMEX


3.合并数据集

Fama和French采用了特定的方法形成公司规模和账面市值比的投资组合  
1.账面市值比计算：  
   - Fama和French使用年度t-1末的市值和年度t-1报告的账面市值  
   - 市值和账面市值可能不一定反映相同的时间点  
   - 滞后期对于实证研究的影响微不足道  
   - 只需提取compustat数据库datedate栏（会计年底的日期）的年份
   - 其他排序变量类似地来自年度t-1 

2.使用sorting_date进行实施：  
   - 使用临时的`sorting_date`列来处理时间滞后。
   - 目标是确保每年每支股票只有一次观测。

In [95]:
size = (crsp_monthly[crsp_monthly["date"].dt.month == 6]
    .assign(sorting_date = lambda x: x["date"] + pd.DateOffset(months=1))
    .rename(columns={"mktcap" : "size"})
    .get(["permno","exchange","sorting_date","size"])
)#根据6月份的市值确定规模组合基础

market_equity = (crsp_monthly[crsp_monthly["date"].dt.month == 12]
    .assign(sorting_date = lambda x: pd.to_datetime((x["date"] + pd.DateOffset(years=1)).dt.year.astype(str) + "-" +  "07-01"))
    .rename(columns={"mktcap" : "me"})
    .get(["permno","gvkey","sorting_date","me"]) 
)#12月底市值

book_to_market =(compustat
    .assign(sorting_date = lambda x: pd.to_datetime((x["datadate"].dt.year+1).astype(str) + "-" +  "07-01"))
    .get(["gvkey","sorting_date","be"])  #财年年底账面价值
    .merge(market_equity,how="inner",on=["gvkey","sorting_date"])
    .assign(bm= lambda x:x["be"]/x["me"])#账面市值比计算
    .get(["permno","sorting_date","me","bm"])
)#根据算得的账面市值比确定价值组合基础

sorting_variables = (size
    .merge(book_to_market,how="inner",on=["permno","sorting_date"])     #将size表和book_to_market表合并
    .dropna()                                                           #使用dropna函数去除合并后DataFrame中的任何包含缺失值（NaN）的行
    .drop_duplicates(subset=["permno","sorting_date"])                  #使用drop_duplicates函数去除重复的行
)

sorting_variables

Unnamed: 0,permno,exchange,sorting_date,size,me,bm
0,10028,AMEX,1993-07-01,9.044750,7.735750,0.104967
1,10028,AMEX,1994-07-01,13.209750,13.567125,0.119922
2,10028,AMEX,1995-07-01,9.192187,13.126500,0.133699
3,10028,AMEX,1996-07-01,8.367688,7.287500,0.243431
4,10028,AMEX,1997-07-01,7.735000,6.158250,0.345228
...,...,...,...,...,...,...
242271,10042,AMEX,2001-07-01,14.687200,11.624000,2.952168
242272,10042,AMEX,2002-07-01,39.130002,14.834060,2.249553
242273,10042,AMEX,2003-07-01,28.366201,22.565760,1.174124
242274,10042,AMEX,2004-07-01,108.659373,96.315269,0.235622


二、投资组合排序

In [96]:
def assign_portfolio(data, sorting_variable, percentiles):   
    breakpoints = (data                                      #data: 要分配到投资组合的数据
      .query("exchange == 'NYSE'")                                     
      .get(sorting_variable)                                 #sorting_variable: 用于排序的变量（市值/账面市值比）
      .quantile(percentiles, interpolation="linear")         #percentiles: 用于确定投资组合分界的百分位数
    )
    breakpoints.iloc[0] = -np.Inf
    breakpoints.iloc[breakpoints.size-1] = np.Inf
    #将第一个分界点设置为负无穷大，将最后一个分界点设置为正无穷大，确保所有数据包含在内
    assigned_portfolios = pd.cut(                            #使用pd.cut函数根据这些分界点将数据分配到不同的投资组合中
      data[sorting_variable],
      bins=breakpoints,
      labels=pd.Series(range(1, breakpoints.size)),          #为每个投资组合指定标签，从1开始到breakpoints.size-1
      include_lowest=True,
      right=False
    )
    
    return assigned_portfolios

In [97]:
portfolios = (sorting_variables
    .groupby(["sorting_date"],group_keys=False)
    .apply(lambda x: x.assign(
        portfolio_size = assign_portfolio(data=x,sorting_variable="size",percentiles=[0,0.5,1]),
        portfolio_bm = assign_portfolio(data=x,sorting_variable="bm",percentiles=[0,0.3,0.7,1])       
        )
    )
    .get(["permno","sorting_date","portfolio_size","portfolio_bm"])
)
portfolios   
#组合已经分好类别：SMB组合（S组/B组）|HML组合（H组/M组/L组）

  .apply(lambda x: x.assign(


Unnamed: 0,permno,sorting_date,portfolio_size,portfolio_bm
0,10028,1993-07-01,1,1
1,10028,1994-07-01,1,1
2,10028,1995-07-01,1,1
3,10028,1996-07-01,1,1
4,10028,1997-07-01,1,1
...,...,...,...,...
242271,10042,2001-07-01,1,3
242272,10042,2002-07-01,1,3
242273,10042,2003-07-01,1,3
242274,10042,2004-07-01,1,1


- 投资组合是在每年t的六月份形成的，七月份的回报被视为相应投资组合的第一个月度回报，即持续周期为t年7月到t+1年6月  
第一步：使用`sorting_date`:如果`date`的月份在上半年，则sorting_date是前一年的7月1日，否则是当年的7月1日  
第二步：将crsp_monthly和portfolios数据表进行合并


In [98]:
portfolios=(crsp_monthly
    .assign(sorting_date = lambda x: np.where(x["date"].dt.month <= 6,pd.to_datetime((x["date"].dt.year-1).astype(str) + "-" +  "07-01"),
                                              pd.to_datetime((x["date"].dt.year).astype(str) + "-" +  "07-01")))
    .merge(portfolios,how="inner",on=["permno","sorting_date"])
)
portfolios

Unnamed: 0,permno,gvkey,date,ret_excess,mktcap,mktcap_lag,exchange,sorting_date,portfolio_size,portfolio_bm
0,10028,012096,1993-03-01,-0.102500,6.329250,7.032500,AMEX,1992-07-01,1,1
1,10028,012096,1993-04-01,0.386489,8.790625,6.329250,AMEX,1992-07-01,1,1
2,10028,012096,1993-05-01,0.197800,10.548750,8.790625,AMEX,1992-07-01,1,1
3,10028,012096,1993-06-01,-0.135833,9.044750,10.548750,AMEX,1992-07-01,1,1
4,10028,012096,1993-07-01,0.189908,10.784125,9.044750,AMEX,1993-07-01,1,1
...,...,...,...,...,...,...,...,...,...,...
2628353,10042,012139,2005-02-01,-0.215192,23.583960,29.989479,AMEX,2004-07-01,1,1
2628354,10042,012139,2005-03-01,-0.113211,21.029761,23.583960,AMEX,2004-07-01,1,1
2628355,10042,012139,2005-04-01,-0.071544,19.569360,21.029761,AMEX,2004-07-01,1,1
2628356,10042,012139,2005-05-01,0.102078,25.538140,19.569360,AMEX,2004-07-01,1,1


三、Fama-French 三因子模型

1.因子复制

- 双排序分组

In [99]:
factors_replicated =(portfolios
    .groupby(["portfolio_size","portfolio_bm","date"])#根据标签分组（S/H，S/M，S/L|B/H，B/M，B/L）
    .apply(lambda x:np.average(x["ret_excess"],weights=x["mktcap_lag"]))
    #对每个组计算超额回报的加权平均值，权重为滞后市值
    #由于财务报告的发布存在时间延迟，投资者在构建投资组合时可能无法获得最新的市值数据。使用滞后的市值数据可以确保数据的可用性和一致性
    .reset_index(name="ret")#将加权平均值的结果列命名为ret
    .groupby("date",group_keys=False)#再次按照date对结果进行分组
    .apply(lambda x:pd.Series(
        {"smb_replicated" : ((x["ret"][x["portfolio_size"] == 1]).mean() - (x["ret"][x["portfolio_size"] == 2]).mean()),
        "hml_replicated" : ((x["ret"][x["portfolio_bm"] == 3]).mean() - (x["ret"][x["portfolio_bm"] == 1]).mean())}))
    #根据SMB组合和HML组合的超额收益的计算公式进行计算
)
factors_replicated

  .groupby(["portfolio_size","portfolio_bm","date"])#根据标签分组（S/H，S/M，S/L|B/H，B/M，B/L）
  .apply(lambda x:np.average(x["ret_excess"],weights=x["mktcap_lag"]))
  .apply(lambda x:pd.Series(


Unnamed: 0_level_0,smb_replicated,hml_replicated
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1961-07-01,-0.017210,-0.003597
1961-08-01,0.000980,-0.022819
1961-09-01,-0.004308,-0.020638
1961-10-01,-0.012085,0.004427
1961-11-01,-0.000050,0.004789
...,...,...
2022-08-01,0.013347,0.008983
2022-09-01,-0.007915,-0.004019
2022-10-01,-0.000423,0.078149
2022-11-01,-0.039395,0.015106


In [100]:
test =(factors_ff3_monthly
    .merge(factors_replicated,how="inner",on=["date"])
    .assign(smb_replicated =lambda x: x["smb_replicated"].round(4),
            hml_replicated =lambda x: x["hml_replicated"].round(4))
)
test

Unnamed: 0,date,smb,hml,smb_replicated,hml_replicated
0,1961-07-01,-0.0190,-0.0009,-0.0172,-0.0036
1,1961-08-01,-0.0175,-0.0028,0.0010,-0.0228
2,1961-09-01,-0.0107,-0.0061,-0.0043,-0.0206
3,1961-10-01,-0.0165,0.0015,-0.0121,0.0044
4,1961-11-01,0.0126,-0.0123,-0.0000,0.0048
...,...,...,...,...,...
733,2022-08-01,0.0140,0.0029,0.0133,0.0090
734,2022-09-01,-0.0081,0.0005,-0.0079,-0.0040
735,2022-10-01,0.0006,0.0801,-0.0004,0.0781
736,2022-11-01,-0.0352,0.0138,-0.0394,0.0151


2、复制结果评价

In [101]:
model_smb = (smf.ols(
    formula="smb ~ smb_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_smb)

model_hml = (smf.ols(
    formula="hml ~ hml_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_hml)

OLS Model:
smb ~ smb_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept         -0.000       0.000       -0.996     0.32
smb_replicated     0.993       0.004      229.091     0.00

Summary statistics:
- Number of observations: 738
- R-squared: 0.986, Adjusted R-squared: 0.986
- F-statistic: 52,482.523 on 1 and 736 DF, p-value: 0.000

OLS Model:
hml ~ hml_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.000       0.000        1.330    0.184
hml_replicated     0.963       0.007      132.345    0.000

Summary statistics:
- Number of observations: 738
- R-squared: 0.960, Adjusted R-squared: 0.960
- F-statistic: 17,515.273 on 1 and 736 DF, p-value: 0.000



四、Fama-French 五因子模型

1.合并数据集

In [102]:
other_sorting_variables = (compustat
    .assign(sorting_date = lambda x: pd.to_datetime((x["datadate"].dt.year+1).astype(str) + "-" +  "07-01"))
    .get(["gvkey","sorting_date","be","op","inv"])
    .merge(market_equity,how="inner",on=["gvkey", "sorting_date"])
    .assign(bm = lambda x: x["be"]/x["me"])
    .get(["permno","sorting_date","me","be","bm","op","inv"])
)

sorting_variables = (size
    .merge(other_sorting_variables,how="inner",on=["permno", "sorting_date"])
    .dropna()
    .drop_duplicates(subset=["permno","sorting_date"])
)
sorting_variables

Unnamed: 0,permno,exchange,sorting_date,size,me,be,bm,op,inv
0,10028,AMEX,1993-07-01,9.044750,7.735750,0.812,0.104967,-0.125616,-0.227702
1,10028,AMEX,1994-07-01,13.209750,13.567125,1.627,0.119922,0.179471,0.667376
2,10028,AMEX,1995-07-01,9.192187,13.126500,1.755,0.133699,0.096296,0.135334
3,10028,AMEX,1996-07-01,8.367688,7.287500,1.774,0.243431,0.015220,0.103739
4,10028,AMEX,1997-07-01,7.735000,6.158250,2.126,0.345228,-0.070555,0.349720
...,...,...,...,...,...,...,...,...,...
242271,10042,AMEX,2001-07-01,14.687200,11.624000,34.316,2.952168,0.247232,-0.276691
242272,10042,AMEX,2002-07-01,39.130002,14.834060,33.370,2.249553,0.084657,-0.059062
242273,10042,AMEX,2003-07-01,28.366201,22.565760,26.495,1.174124,-0.011096,-0.236247
242274,10042,AMEX,2004-07-01,108.659373,96.315269,22.694,0.235622,-0.305852,-0.100281


2.投资组合排序

In [103]:
portfolios = (sorting_variables
    .groupby("sorting_date",group_keys=False)
    .apply(lambda x: x.assign(portfolio_size =  assign_portfolio(data=x,sorting_variable="size",percentiles=[0,0.5,1])))
    .groupby(["sorting_date","portfolio_size"],group_keys=False)
    .apply(lambda x: x.assign(**{f"portfolio_{col}" : assign_portfolio(data=x,sorting_variable=col,percentiles=[0,0.3,0.7,1]) for col in ["bm","op","inv"]}))
    .get(["permno","sorting_date","portfolio_size","portfolio_bm","portfolio_op","portfolio_inv"])
)
#组合已经分好类别：SMB组合（S组/B组）|HML组合（H组/M组/L组）|RMW组（R组/M组/W组）|CMA组（C组/M组/A组）
portfolios

  .apply(lambda x: x.assign(portfolio_size =  assign_portfolio(data=x,sorting_variable="size",percentiles=[0,0.5,1])))
  .groupby(["sorting_date","portfolio_size"],group_keys=False)
  .apply(lambda x: x.assign(**{f"portfolio_{col}" : assign_portfolio(data=x,sorting_variable=col,percentiles=[0,0.3,0.7,1]) for col in ["bm","op","inv"]}))


Unnamed: 0,permno,sorting_date,portfolio_size,portfolio_bm,portfolio_op,portfolio_inv
0,10028,1993-07-01,1,1,1,1
1,10028,1994-07-01,1,1,2,3
2,10028,1995-07-01,1,1,1,2
3,10028,1996-07-01,1,1,1,2
4,10028,1997-07-01,1,1,1,3
...,...,...,...,...,...,...
242271,10042,2001-07-01,1,3,2,1
242272,10042,2002-07-01,1,3,1,1
242273,10042,2003-07-01,1,2,1,1
242274,10042,2004-07-01,1,1,1,1


In [104]:
portfolios = (crsp_monthly
    .assign(sorting_date = lambda x: np.where(x["date"].dt.month <= 6,pd.to_datetime((x["date"].dt.year-1).astype(str) + "-07-01"),
                                              pd.to_datetime((x["date"].dt.year).astype(str) + "-07-01")))
    .merge(portfolios,how="inner",on=["permno","sorting_date"])
)

portfolios

Unnamed: 0,permno,gvkey,date,ret_excess,mktcap,mktcap_lag,exchange,sorting_date,portfolio_size,portfolio_bm,portfolio_op,portfolio_inv
0,10028,012096,1993-03-01,-0.102500,6.329250,7.032500,AMEX,1992-07-01,1,1,1,1
1,10028,012096,1993-04-01,0.386489,8.790625,6.329250,AMEX,1992-07-01,1,1,1,1
2,10028,012096,1993-05-01,0.197800,10.548750,8.790625,AMEX,1992-07-01,1,1,1,1
3,10028,012096,1993-06-01,-0.135833,9.044750,10.548750,AMEX,1992-07-01,1,1,1,1
4,10028,012096,1993-07-01,0.189908,10.784125,9.044750,AMEX,1993-07-01,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
2555243,10042,012139,2005-02-01,-0.215192,23.583960,29.989479,AMEX,2004-07-01,1,1,1,1
2555244,10042,012139,2005-03-01,-0.113211,21.029761,23.583960,AMEX,2004-07-01,1,1,1,1
2555245,10042,012139,2005-04-01,-0.071544,19.569360,21.029761,AMEX,2004-07-01,1,1,1,1
2555246,10042,012139,2005-05-01,0.102078,25.538140,19.569360,AMEX,2004-07-01,1,1,1,1


3.因子复制

- 双排序分组

In [105]:
portfolios_value = (portfolios
    .groupby(["portfolio_size","portfolio_bm","date"])
    .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
    .reset_index(name ="ret")
)

factors_value = (portfolios_value
    .groupby("date",group_keys = False)
    .apply(lambda x: pd.Series({"hml_replicated" : np.nanmean(x[x["portfolio_bm"] == 3]["ret"]) - np.nanmean(x[x["portfolio_bm"] == 1]["ret"])}))
)
factors_value

  .groupby(["portfolio_size","portfolio_bm","date"])
  .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
  .apply(lambda x: pd.Series({"hml_replicated" : np.nanmean(x[x["portfolio_bm"] == 3]["ret"]) - np.nanmean(x[x["portfolio_bm"] == 1]["ret"])}))


Unnamed: 0_level_0,hml_replicated
date,Unnamed: 1_level_1
1962-07-01,-0.024053
1962-08-01,-0.006577
1962-09-01,0.004991
1962-10-01,-0.002795
1962-11-01,0.009973
...,...
2022-08-01,0.013136
2022-09-01,-0.000538
2022-10-01,0.072222
2022-11-01,0.011634


In [106]:
portfolios_profitability = (portfolios
    .groupby(["portfolio_size","portfolio_op","date"])
    .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
    .reset_index(name ="ret")
)

factors_profitability = (portfolios_profitability
    .groupby("date",group_keys = False)
    .apply(lambda x: pd.Series({"rmw_replicated" : np.nanmean(x[x["portfolio_op"] == 3]["ret"]) - np.nanmean(x[x["portfolio_op"] == 1]["ret"])}))
)
factors_profitability

  .groupby(["portfolio_size","portfolio_op","date"])
  .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
  .apply(lambda x: pd.Series({"rmw_replicated" : np.nanmean(x[x["portfolio_op"] == 3]["ret"]) - np.nanmean(x[x["portfolio_op"] == 1]["ret"])}))


Unnamed: 0_level_0,rmw_replicated
date,Unnamed: 1_level_1
1962-07-01,0.020984
1962-08-01,0.008683
1962-09-01,0.000504
1962-10-01,0.012590
1962-11-01,-0.007101
...,...
2022-08-01,-0.049411
2022-09-01,-0.009860
2022-10-01,0.042216
2022-11-01,0.061493


In [107]:
portfolios_investment = (portfolios
    .groupby(["portfolio_size","portfolio_inv","date"])
    .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
    .reset_index(name ="ret")
)

factors_investment = (portfolios_investment
    .groupby("date",group_keys = False)
    .apply(lambda x: pd.Series({"cma_replicated" : np.nanmean(x[x["portfolio_inv"] == 1]["ret"]) - np.nanmean(x[x["portfolio_inv"] == 3]["ret"])}))
)
factors_investment

  .groupby(["portfolio_size","portfolio_inv","date"])
  .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
  .apply(lambda x: pd.Series({"cma_replicated" : np.nanmean(x[x["portfolio_inv"] == 1]["ret"]) - np.nanmean(x[x["portfolio_inv"] == 3]["ret"])}))


Unnamed: 0_level_0,cma_replicated
date,Unnamed: 1_level_1
1962-07-01,-0.029187
1962-08-01,0.006461
1962-09-01,0.002035
1962-10-01,0.007685
1962-11-01,-0.002213
...,...
2022-08-01,0.017325
2022-09-01,-0.005703
2022-10-01,0.069982
2022-11-01,0.032976


$$
R_{SMB} = 1/9（R_{S/H} + R_{S/M} + R_{S/L} + R_{S/R} + R_{S/M} + R_{S/W} + R_{S/C} + R_{S/M} + R_{S/A})   
- 1/9（R_{B/H} + R_{B/M} + R_{B/L} + R_{B/R} + R_{B/M} + R_{B/W} + R_{B/C} + R_{B/M} + R_{B/A})
$$

In [108]:
factors_size = (pd.concat([portfolios_value, portfolios_profitability, portfolios_investment], ignore_index=True)
                .groupby("date",group_keys=False)
                .apply(lambda x: pd.Series({"smb_replicated": np.nanmean(x[x["portfolio_size"] == 1]["ret"]) - np.nanmean(x[x["portfolio_size"] == 2]["ret"])}))
)
factors_size

  .apply(lambda x: pd.Series({"smb_replicated": np.nanmean(x[x["portfolio_size"] == 1]["ret"]) - np.nanmean(x[x["portfolio_size"] == 2]["ret"])}))


Unnamed: 0_level_0,smb_replicated
date,Unnamed: 1_level_1
1962-07-01,-0.004281
1962-08-01,0.003765
1962-09-01,-0.012423
1962-10-01,-0.026269
1962-11-01,0.020659
...,...
2022-08-01,0.016857
2022-09-01,-0.010323
2022-10-01,0.018472
2022-11-01,-0.032135


In [109]:
factors_replicated = (factors_size
    .merge(factors_value,how="outer",on=["date"])
    .merge(factors_profitability,how="outer",on=["date"])
    .merge(factors_investment,how="outer",on=["date"])
)
factors_replicated

Unnamed: 0_level_0,smb_replicated,hml_replicated,rmw_replicated,cma_replicated
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1962-07-01,-0.004281,-0.024053,0.020984,-0.029187
1962-08-01,0.003765,-0.006577,0.008683,0.006461
1962-09-01,-0.012423,0.004991,0.000504,0.002035
1962-10-01,-0.026269,-0.002795,0.012590,0.007685
1962-11-01,0.020659,0.009973,-0.007101,-0.002213
...,...,...,...,...
2022-08-01,0.016857,0.013136,-0.049411,0.017325
2022-09-01,-0.010323,-0.000538,-0.009860,-0.005703
2022-10-01,0.018472,0.072222,0.042216,0.069982
2022-11-01,-0.032135,0.011634,0.061493,0.032976


In [110]:
test = (factors_ff5_monthly
    .merge(factors_replicated,how="inner",on=["date"])
    .assign(smb_replicated =lambda x: x["smb_replicated"].round(4),
            hml_replicated =lambda x: x["hml_replicated"].round(4),
            rmw_replicated =lambda x: x["rmw_replicated"].round(4),
            cma_replicated =lambda x: x["cma_replicated"].round(4))
)
test

Unnamed: 0,date,smb,hml,rmw,cma,smb_replicated,hml_replicated,rmw_replicated,cma_replicated
0,1963-07-01,-0.0041,-0.0097,0.0068,-0.0118,-0.0146,-0.0006,0.0006,-0.0096
1,1963-08-01,-0.0080,0.0180,0.0036,-0.0035,-0.0030,0.0046,0.0022,0.0039
2,1963-09-01,-0.0052,0.0013,-0.0071,0.0029,-0.0078,0.0051,-0.0095,-0.0019
3,1963-10-01,-0.0139,-0.0010,0.0280,-0.0201,-0.0079,-0.0151,0.0291,-0.0215
4,1963-11-01,-0.0088,0.0175,-0.0051,0.0224,-0.0069,0.0084,-0.0057,0.0129
...,...,...,...,...,...,...,...,...,...
709,2022-08-01,0.0152,0.0029,-0.0475,0.0129,0.0169,0.0131,-0.0494,0.0173
710,2022-09-01,-0.0105,0.0005,-0.0151,-0.0080,-0.0103,-0.0005,-0.0099,-0.0057
711,2022-10-01,0.0189,0.0801,0.0334,0.0664,0.0185,0.0722,0.0422,0.0700
712,2022-11-01,-0.0274,0.0138,0.0638,0.0318,-0.0321,0.0116,0.0615,0.0330


4.复制结果评价

In [111]:
model_smb = (smf.ols(
    formula="smb ~ smb_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_smb)

OLS Model:
smb ~ smb_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          -0.00       0.000       -1.495    0.135
smb_replicated      0.97       0.004      221.609    0.000

Summary statistics:
- Number of observations: 714
- R-squared: 0.986, Adjusted R-squared: 0.986
- F-statistic: 49,110.478 on 1 and 712 DF, p-value: 0.000



In [112]:
model_hml = (smf.ols(
    formula="hml ~ hml_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_hml)

OLS Model:
hml ~ hml_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.000        0.00        1.591    0.112
hml_replicated     0.992        0.01       96.544    0.000

Summary statistics:
- Number of observations: 714
- R-squared: 0.929, Adjusted R-squared: 0.929
- F-statistic: 9,320.798 on 1 and 712 DF, p-value: 0.000



In [113]:
model_rmw = (smf.ols(
    formula="rmw ~ rmw_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_rmw)

OLS Model:
rmw ~ rmw_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.000       0.000        0.277    0.782
rmw_replicated     0.955       0.009      107.519    0.000

Summary statistics:
- Number of observations: 714
- R-squared: 0.942, Adjusted R-squared: 0.942
- F-statistic: 11,560.251 on 1 and 712 DF, p-value: 0.000



In [114]:
model_cma = (smf.ols(
    formula="cma ~ cma_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_cma)

OLS Model:
cma ~ cma_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.001       0.000        3.928      0.0
cma_replicated     0.965       0.008      117.780      0.0

Summary statistics:
- Number of observations: 714
- R-squared: 0.951, Adjusted R-squared: 0.951
- F-statistic: 13,872.118 on 1 and 712 DF, p-value: 0.000



练习：
1.Fama和French（1993）声称他们的样本只保留了那些在Compustat中出现多于两年的公司。实现这个额外的过滤器，并比较复制工作的改进。

- 数据预处理  
比较复制效果

（1）额外的过滤器：对初始Compustat数据进行处理，去除那些只出现过一年的公司

In [115]:
compustat_filter = (compustat
    .assign(year = lambda x:x["datadate"].dt.year) #给compustat添加一个新列year，该列只包含datadate列中日期的年份部分
    .drop_duplicates(subset=["gvkey","year"])      #删除gvkey和year列中的重复记录，确保每个gvkey在每个year中只出现一次
    .groupby("gvkey")                              #按照gvkey列对数据进行分组
    .apply(lambda x: x["year"].count())            #对每个gvkey分组应用一个函数，计算每个公司对应年的计数
    .reset_index(name="num")                       #重置索引，并将计数结果列命名为num
    .query("num >= 2")                             #查询num列中值大于或等于2的记录，筛选出至少有两年数据记录的gvkey
    .drop(columns=["num"])                         #删除num列
    .merge(compustat,how="inner",on=["gvkey"])     #将筛选后的结果与原始compustat根据gvkey进行合并
)
compustat_filter

  .apply(lambda x: x["year"].count())            #对每个gvkey分组应用一个函数，计算每个公司对应年的计数


Unnamed: 0,gvkey,datadate,be,op,inv
0,001000,1961-12-31,,,
1,001000,1962-12-31,0.552,2.880435,
2,001000,1963-12-31,0.561,0.046346,
3,001000,1964-12-31,0.627,0.149920,
4,001000,1965-12-31,0.491,-0.452138,0.631356
...,...,...,...,...,...
548097,351590,2022-12-31,21499.887,0.255406,0.096925
548098,353444,2021-12-31,40233.020,0.078487,
548099,353444,2022-12-31,24016.067,0.122968,-0.100782
548100,353945,2021-12-31,305.149,0.672177,


- 复制过程没有变动

（2）三因子模型复制

In [116]:
book_to_market =(compustat_filter
    .assign(sorting_date = lambda x: pd.to_datetime((x["datadate"].dt.year+1).astype(str) + "-" +  "07-01"))
    .get(["gvkey","sorting_date","be"])
    .merge(market_equity,how="inner",on=["gvkey","sorting_date"])
    .assign(bm= lambda x:x["be"]/x["me"])
    .get(["permno","sorting_date","me","bm"])
)

sorting_variables = (size
    .merge(book_to_market,how="inner",on=["permno","sorting_date"])
    .dropna()
    .drop_duplicates(subset=["permno","sorting_date"])
)

portfolios = (sorting_variables
    .groupby(["sorting_date"],group_keys=False)
    .apply(lambda x: x.assign(
        portfolio_size = assign_portfolio(data=x,sorting_variable="size",percentiles=[0,0.5,1]),
        portfolio_bm = assign_portfolio(data=x,sorting_variable="bm",percentiles=[0,0.3,0.7,1])       
        )
    )
    .get(["permno","sorting_date","portfolio_size","portfolio_bm"])
)

portfolios=(crsp_monthly
    .assign(sorting_date = lambda x: np.where(x["date"].dt.month <= 6,pd.to_datetime((x["date"].dt.year-1).astype(str) + "-" +  "07-01"),
                                              pd.to_datetime((x["date"].dt.year).astype(str) + "-" +  "07-01")))
    .merge(portfolios,how="inner",on=["permno","sorting_date"])
)

factors_replicated =(portfolios
    .groupby(["portfolio_size","portfolio_bm","date"])
    .apply(lambda x:np.average(x["ret_excess"],weights=x["mktcap_lag"]))
    .reset_index(name="ret")
    .groupby("date",group_keys=False)
    .apply(lambda x:pd.Series(
        {"smb_replicated" : ((x["ret"][x["portfolio_size"] == 1]).mean() - (x["ret"][x["portfolio_size"] == 2]).mean()),
        "hml_replicated" : ((x["ret"][x["portfolio_bm"] == 3]).mean() - (x["ret"][x["portfolio_bm"] == 1]).mean())}))
)

test =(factors_ff3_monthly
    .merge(factors_replicated,how="inner",on=["date"])
    .assign(smb_replicated =lambda x: x["smb_replicated"].round(4),
            hml_replicated =lambda x: x["hml_replicated"].round(4))
)

  .apply(lambda x: x.assign(
  .groupby(["portfolio_size","portfolio_bm","date"])
  .apply(lambda x:np.average(x["ret_excess"],weights=x["mktcap_lag"]))
  .apply(lambda x:pd.Series(


复制结果评价

In [117]:
model_smb = (smf.ols(
    formula="smb ~ smb_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_smb)

model_hml = (smf.ols(
    formula="hml ~ hml_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_hml)

OLS Model:
smb ~ smb_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept         -0.000       0.000       -1.122    0.262
smb_replicated     0.993       0.004      228.622    0.000

Summary statistics:
- Number of observations: 738
- R-squared: 0.986, Adjusted R-squared: 0.986
- F-statistic: 52,268.006 on 1 and 736 DF, p-value: 0.000

OLS Model:
hml ~ hml_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.000       0.000        1.281      0.2
hml_replicated     0.962       0.007      132.875      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.960, Adjusted R-squared: 0.960
- F-statistic: 17,655.895 on 1 and 736 DF, p-value: 0.000



（3）五因子模型复制

In [118]:
other_sorting_variables = (compustat_filter
    .assign(sorting_date = lambda x: pd.to_datetime((x["datadate"].dt.year+1).astype(str) + "-" +  "07-01"))
    .get(["gvkey","sorting_date","be","op","inv"])
    .merge(market_equity,how="inner",on=["gvkey", "sorting_date"])
    .assign(bm = lambda x: x["be"]/x["me"])
    .get(["permno","sorting_date","me","be","bm","op","inv"])
)

sorting_variables = (size
    .merge(other_sorting_variables,how="inner",on=["permno", "sorting_date"])
    .dropna()
    .drop_duplicates(subset=["permno","sorting_date"])
)

portfolios = (sorting_variables
    .groupby("sorting_date",group_keys=False)
    .apply(lambda x: x.assign(portfolio_size =  assign_portfolio(data=x,sorting_variable="size",percentiles=[0,0.5,1])))
    .groupby(["sorting_date","portfolio_size"],group_keys=False)
    .apply(lambda x: x.assign(**{f"portfolio_{col}" : assign_portfolio(data=x,sorting_variable=col,percentiles=[0,0.3,0.7,1]) for col in ["bm","op","inv"]}))
    .get(["permno","sorting_date","portfolio_size","portfolio_bm","portfolio_op","portfolio_inv"])
)

portfolios = (crsp_monthly
    .assign(sorting_date = lambda x: np.where(x["date"].dt.month <= 6,pd.to_datetime((x["date"].dt.year-1).astype(str) + "-07-01"),
                                              pd.to_datetime((x["date"].dt.year).astype(str) + "-07-01")))
    .merge(portfolios,how="inner",on=["permno","sorting_date"])
)

portfolios_value = (portfolios
    .groupby(["portfolio_size","portfolio_bm","date"])
    .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
    .reset_index(name ="ret")
)

factors_value = (portfolios_value
    .groupby("date",group_keys = False)
    .apply(lambda x: pd.Series({"hml_replicated" : np.nanmean(x[x["portfolio_bm"] == 3]["ret"]) - np.nanmean(x[x["portfolio_bm"] == 1]["ret"])}))
)

portfolios_profitability = (portfolios
    .groupby(["portfolio_size","portfolio_op","date"])
    .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
    .reset_index(name ="ret")
)

factors_profitability = (portfolios_profitability
    .groupby("date",group_keys = False)
    .apply(lambda x: pd.Series({"rmw_replicated" : np.nanmean(x[x["portfolio_op"] == 3]["ret"]) - np.nanmean(x[x["portfolio_op"] == 1]["ret"])}))
)

portfolios_investment = (portfolios
    .groupby(["portfolio_size","portfolio_inv","date"])
    .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
    .reset_index(name ="ret")
)

factors_investment = (portfolios_investment
    .groupby("date",group_keys = False)
    .apply(lambda x: pd.Series({"cma_replicated" : np.nanmean(x[x["portfolio_inv"] == 1]["ret"]) - np.nanmean(x[x["portfolio_inv"] == 3]["ret"])}))
)

factors_size = (pd.concat([portfolios_value, portfolios_profitability, portfolios_investment], ignore_index=True)
                .groupby("date",group_keys=False)
                .apply(lambda x: pd.Series({"smb_replicated": np.nanmean(x[x["portfolio_size"] == 1]["ret"]) - np.nanmean(x[x["portfolio_size"] == 2]["ret"])}))
)

factors_replicated = (factors_size
    .merge(factors_value,how="outer",on=["date"])
    .merge(factors_profitability,how="outer",on=["date"])
    .merge(factors_investment,how="outer",on=["date"])
)

test = (factors_ff5_monthly
    .merge(factors_replicated,how="inner",on=["date"])
    .assign(smb_replicated =lambda x: x["smb_replicated"].round(4),
            hml_replicated =lambda x: x["hml_replicated"].round(4),
            rmw_replicated =lambda x: x["rmw_replicated"].round(4),
            cma_replicated =lambda x: x["cma_replicated"].round(4))
)

  .apply(lambda x: x.assign(portfolio_size =  assign_portfolio(data=x,sorting_variable="size",percentiles=[0,0.5,1])))
  .groupby(["sorting_date","portfolio_size"],group_keys=False)
  .apply(lambda x: x.assign(**{f"portfolio_{col}" : assign_portfolio(data=x,sorting_variable=col,percentiles=[0,0.3,0.7,1]) for col in ["bm","op","inv"]}))
  .groupby(["portfolio_size","portfolio_bm","date"])
  .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
  .apply(lambda x: pd.Series({"hml_replicated" : np.nanmean(x[x["portfolio_bm"] == 3]["ret"]) - np.nanmean(x[x["portfolio_bm"] == 1]["ret"])}))
  .groupby(["portfolio_size","portfolio_op","date"])
  .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_lag"]))
  .apply(lambda x: pd.Series({"rmw_replicated" : np.nanmean(x[x["portfolio_op"] == 3]["ret"]) - np.nanmean(x[x["portfolio_op"] == 1]["ret"])}))
  .groupby(["portfolio_size","portfolio_inv","date"])
  .apply(lambda x: np.average(x["ret_excess"],weights = x["mktcap_

复制效果评价

In [119]:
model_smb = (smf.ols(
    formula="smb ~ smb_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_smb)

model_hml = (smf.ols(
    formula="hml ~ hml_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_hml)

model_rmw = (smf.ols(
    formula="rmw ~ rmw_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_rmw)

model_cma = (smf.ols(
    formula="cma ~ cma_replicated", 
    data=test
  )
  .fit()
)
prettify_result(model_cma)

OLS Model:
smb ~ smb_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          -0.00       0.000       -1.495    0.135
smb_replicated      0.97       0.004      221.609    0.000

Summary statistics:
- Number of observations: 714
- R-squared: 0.986, Adjusted R-squared: 0.986
- F-statistic: 49,110.478 on 1 and 712 DF, p-value: 0.000

OLS Model:
hml ~ hml_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.000        0.00        1.591    0.112
hml_replicated     0.992        0.01       96.544    0.000

Summary statistics:
- Number of observations: 714
- R-squared: 0.929, Adjusted R-squared: 0.929
- F-statistic: 9,320.798 on 1 and 712 DF, p-value: 0.000

OLS Model:
rmw ~ rmw_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.000       0.000        0.277    0.782
rmw_replicated     0.955       0.009      107.519    0.000

Summary statisti

2.在他的主页上，Kenneth French提供了如何构建用于投资组合排序的最常见变量的说明。尝试复制他主页上提供的E/P（收益/价格）的单变量投资组合排序回报时间序列，并使用回归评估您的复制工作。  

- 类似于对一个新的因子进行复制  
区别：根据变量E/P分类构造组合之后不再通过组合间的做多做空构造一个新的特殊的投资组合

- E/P（收益/价格）：  
对象：CRSP数据库中所有的美国普通股  
组合构建时间：每年（记为t）6月  
分类标准：收益价格比率=t-1年财年年底收益/t-1年12月底市值  
断点：NYSE股票收益价格比率<0;第30和第70百分位数;五分位数；十分位数  

（1）数据读取

In [120]:
from datetime import datetime   #导入了Python的datetime模块，它提供了日期和时间处理的功能

ep_portfolios_return = pd.read_csv(r'C:\Users\86189\Desktop\For Students\ep_portfolios.csv')
ep_portfolios_return = (ep_portfolios_return
    .assign(date = lambda x: x["date"].astype(str).apply(lambda y: datetime.strptime(y,"%Y%m")))
)
ep_portfolios_return

Unnamed: 0,date,<= 0,Lo 30,Med 40,Hi 30,Lo 20,Qnt 2,Qnt 3,Qnt 4,Hi 20,Lo 10,2-Dec,3-Dec,4-Dec,5-Dec,6-Dec,7-Dec,8-Dec,9-Dec,Hi 10
0,1951-07-01,3.01,6.39,8.56,5.97,5.77,7.99,9.58,4.94,9.12,5.21,7.34,8.96,7.44,9.41,9.67,7.09,4.50,9.84,8.16
1,1951-08-01,14.65,3.88,5.72,6.65,2.92,6.56,5.67,6.63,6.02,2.55,3.95,7.73,5.90,4.83,6.09,5.05,6.95,5.74,6.41
2,1951-09-01,-2.90,-0.23,-0.15,1.65,-0.50,1.36,-1.78,1.76,1.65,-1.22,1.47,0.80,1.67,-0.01,-2.65,2.26,1.66,2.18,0.93
3,1951-10-01,-3.62,-4.12,-1.12,-2.26,-4.56,-3.22,1.12,-2.59,-1.99,-4.93,-3.60,-2.43,-3.66,-1.66,2.52,-3.56,-2.39,-3.00,-0.59
4,1951-11-01,8.33,0.21,0.69,1.30,0.15,0.18,0.98,2.07,-0.30,-0.21,1.10,0.40,0.06,1.29,0.83,2.01,2.08,0.06,-0.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
875,2024-06-01,3.86,4.21,2.99,-2.05,5.12,3.87,1.09,0.83,-2.31,3.28,6.16,2.99,5.02,2.00,-1.20,2.39,-1.83,-1.37,-3.66
876,2024-07-01,2.28,-0.17,2.95,5.54,-1.93,4.10,2.15,5.71,5.56,-3.28,-0.85,4.71,3.27,0.36,6.79,5.89,5.51,5.74,4.71
877,2024-08-01,0.64,2.34,2.09,1.71,2.16,3.57,1.28,0.09,2.72,4.04,0.70,2.83,4.58,0.97,2.02,0.26,-0.09,3.72,-1.95
878,2024-09-01,1.11,2.77,2.33,-0.68,2.75,2.01,3.21,1.60,-1.86,0.98,4.18,2.82,0.91,3.77,1.85,1.68,1.52,-1.92,-1.54


In [121]:
compustat = (pd.read_sql_query(
    sql="SELECT * FROM compustat",   #从 compustat 表中选择所有列
    con=tidy_finance,
    parse_dates={"datadate"})
)

(2) 合并数据集

In [122]:
earnings_prices = (compustat
    .assign(earn = lambda x: (compustat['sale'] - compustat['cogs'].fillna(0) - compustat['xsga'].fillna(0) - compustat['xint'].fillna(0)),
            #使用了.fillna(0)来处理缺失值，将它们替换为0
            #收益=销售收入减去成本、期间费用以及利息支出
            sorting_date = lambda x: pd.to_datetime((x["datadate"].dt.year+1).astype(str) + "-" +  "07-01"))#统一时间点
    .get(["gvkey","sorting_date","earn"])
    .merge(market_equity,how="inner",on=["gvkey","sorting_date"])
    .assign(ep= lambda x:x["earn"]/x["me"])#计算收益/价格
    .get(["permno","sorting_date","me","ep"])
)

sorting_variables = (size
    .merge(earnings_prices,how="inner",on=["permno","sorting_date"])
    .dropna()
    .drop_duplicates(subset=["permno","sorting_date"])
)
sorting_variables

Unnamed: 0,permno,exchange,sorting_date,size,me,ep
0,10028,AMEX,1993-07-01,9.044750,7.735750,-0.013186
1,10028,AMEX,1994-07-01,13.209750,13.567125,0.021523
2,10028,AMEX,1995-07-01,9.192187,13.126500,0.012875
3,10028,AMEX,1996-07-01,8.367688,7.287500,0.003705
4,10028,AMEX,1997-07-01,7.735000,6.158250,-0.024358
...,...,...,...,...,...,...
242271,10042,AMEX,2001-07-01,14.687200,11.624000,0.729869
242272,10042,AMEX,2002-07-01,39.130002,14.834060,0.190440
242273,10042,AMEX,2003-07-01,28.366201,22.565760,-0.013029
242274,10042,AMEX,2004-07-01,108.659373,96.315269,-0.072065


（3）因子复制

- 分类标准为ep < 0 

In [123]:
portfolios_minus = (crsp_monthly
    .assign(sorting_date = lambda x: np.where(x["date"].dt.month <= 6,pd.to_datetime((x["date"].dt.year-1).astype(str) + "-" +  "07-01"),
                                              pd.to_datetime((x["date"].dt.year).astype(str) + "-" +  "07-01")))
    .merge(sorting_variables,how="inner",on=["permno","sorting_date"])
    .query("ep < 0")#筛选出ep列值小于0的行
    .groupby("date")
    .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)
    .reset_index(name="ret")  
)
portfolios_minus

  .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)


Unnamed: 0,date,ret
0,1961-07-01,5.485720
1,1961-08-01,-5.723373
2,1961-09-01,-9.501958
3,1961-10-01,1.492499
4,1961-11-01,0.668542
...,...,...
733,2022-08-01,2.137687
734,2022-09-01,-10.537406
735,2022-10-01,3.353137
736,2022-11-01,-1.542637


In [124]:
test_portfolios_minus = (ep_portfolios_return
    .get(["date","<= 0"])
    .merge(portfolios_minus,how="inner",on=["date"])
    .rename(columns={"<= 0" : "minus","ret" : "minus_replicated"})
)
test_portfolios_minus

Unnamed: 0,date,minus,minus_replicated
0,1961-07-01,4.37,5.485720
1,1961-08-01,-4.42,-5.723373
2,1961-09-01,-8.89,-9.501958
3,1961-10-01,1.06,1.492499
4,1961-11-01,1.94,0.668542
...,...,...,...
733,2022-08-01,1.76,2.137687
734,2022-09-01,-9.85,-10.537406
735,2022-10-01,6.20,3.353137
736,2022-11-01,1.42,-1.542637


复制效果评价

In [125]:
model_portfolios_minus = (smf.ols(
    formula="minus ~ minus_replicated", 
    data=test_portfolios_minus
  )
  .fit()
)
prettify_result(model_portfolios_minus)

OLS Model:
minus ~ minus_replicated

Coefficients:
                  Estimate  Std. Error  t-Statistic  p-Value
Intercept            0.812       0.104        7.826      0.0
minus_replicated     0.841       0.013       65.325      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.853, Adjusted R-squared: 0.853
- F-statistic: 4,267.317 on 1 and 736 DF, p-value: 0.000



- 断点为第30和第70百分位数组合  
- 断点为五分位数组合  
- 断点为十分位数组合  

In [126]:
portfolios = (sorting_variables
    .groupby(["sorting_date"],group_keys=False)
    .apply(lambda x: x.assign(portfolio_ep_three = assign_portfolio(data=x,sorting_variable="ep",percentiles=[0,0.3,0.7,1]),
                              portfolio_ep_five = assign_portfolio(data=x,sorting_variable="ep",percentiles=[0,0.2,0.4,0.6,0.8,1]),
                              portfolio_ep_ten = assign_portfolio(data=x,sorting_variable="ep",percentiles=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])))
    .get(["permno","sorting_date","portfolio_ep_three","portfolio_ep_five","portfolio_ep_ten"])
)
portfolios

  .apply(lambda x: x.assign(portfolio_ep_three = assign_portfolio(data=x,sorting_variable="ep",percentiles=[0,0.3,0.7,1]),


Unnamed: 0,permno,sorting_date,portfolio_ep_three,portfolio_ep_five,portfolio_ep_ten
0,10028,1993-07-01,1,1,1
1,10028,1994-07-01,1,1,1
2,10028,1995-07-01,1,1,1
3,10028,1996-07-01,1,1,1
4,10028,1997-07-01,1,1,1
...,...,...,...,...,...
242271,10042,2001-07-01,3,5,10
242272,10042,2002-07-01,3,4,8
242273,10042,2003-07-01,1,1,1
242274,10042,2004-07-01,1,1,1


In [127]:
portfolios=(crsp_monthly
    .assign(sorting_date = lambda x: np.where(x["date"].dt.month <= 6,pd.to_datetime((x["date"].dt.year-1).astype(str) + "-" +  "07-01"),
                                              pd.to_datetime((x["date"].dt.year).astype(str) + "-" +  "07-01")))
    .merge(portfolios,how="inner",on=["permno","sorting_date"])
)
portfolios

Unnamed: 0,permno,gvkey,date,ret_excess,mktcap,mktcap_lag,exchange,sorting_date,portfolio_ep_three,portfolio_ep_five,portfolio_ep_ten
0,10028,012096,1993-03-01,-0.102500,6.329250,7.032500,AMEX,1992-07-01,1,1,2
1,10028,012096,1993-04-01,0.386489,8.790625,6.329250,AMEX,1992-07-01,1,1,2
2,10028,012096,1993-05-01,0.197800,10.548750,8.790625,AMEX,1992-07-01,1,1,2
3,10028,012096,1993-06-01,-0.135833,9.044750,10.548750,AMEX,1992-07-01,1,1,2
4,10028,012096,1993-07-01,0.189908,10.784125,9.044750,AMEX,1993-07-01,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...
2707144,10042,012139,2005-02-01,-0.215192,23.583960,29.989479,AMEX,2004-07-01,1,1,1
2707145,10042,012139,2005-03-01,-0.113211,21.029761,23.583960,AMEX,2004-07-01,1,1,1
2707146,10042,012139,2005-04-01,-0.071544,19.569360,21.029761,AMEX,2004-07-01,1,1,1
2707147,10042,012139,2005-05-01,0.102078,25.538140,19.569360,AMEX,2004-07-01,1,1,1


- 断点为第30和第70百分位数

In [128]:
portfolio_ep_three =(portfolios
    .groupby(["portfolio_ep_three","date"])
    .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)
    .reset_index(name="ret")
    .pivot_table(index='date', columns='portfolio_ep_three', values='ret')
    #创建一个透视表，其中date列成为行索引，portfolio_ep_three列的值成为列索引，ret列的值填充到透视表中对应的单元格
    .reset_index()
)
portfolio_ep_three

  .groupby(["portfolio_ep_three","date"])
  .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)
  .pivot_table(index='date', columns='portfolio_ep_three', values='ret')


portfolio_ep_three,date,1,2,3
0,1961-07-01,1.820810,3.891827,4.427611
1,1961-08-01,3.585676,1.400427,1.337895
2,1961-09-01,-2.107969,-2.676887,-0.142716
3,1961-10-01,2.498183,3.150545,2.039614
4,1961-11-01,3.321676,4.944449,5.977436
...,...,...,...,...
733,2022-08-01,-4.986330,-3.404854,-2.392822
734,2022-09-01,-9.901886,-8.864238,-9.437047
735,2022-10-01,2.631081,8.931982,14.667344
736,2022-11-01,3.359694,5.099187,5.398848


In [129]:
test_portfolios_three = (ep_portfolios_return
    .get(["date","Lo 30","Med 40","Hi 30"])
    .rename(columns=lambda x: x.replace(' ', '_'))
    .merge(portfolio_ep_three,how="inner",on=["date"])
    .rename(columns=lambda x: 'portfolio_' + str(x) if str(x).isdigit() else x)
)
test_portfolios_three

Unnamed: 0,date,Lo_30,Med_40,Hi_30,portfolio_1,portfolio_2,portfolio_3
0,1961-07-01,2.73,3.69,3.43,1.820810,3.891827,4.427611
1,1961-08-01,2.79,2.96,1.06,3.585676,1.400427,1.337895
2,1961-09-01,-1.30,-2.75,-0.71,-2.107969,-2.676887,-0.142716
3,1961-10-01,2.25,3.68,2.94,2.498183,3.150545,2.039614
4,1961-11-01,4.24,4.36,6.83,3.321676,4.944449,5.977436
...,...,...,...,...,...,...,...
733,2022-08-01,-4.98,-3.29,-2.40,-4.986330,-3.404854,-2.392822
734,2022-09-01,-9.61,-8.39,-8.91,-9.901886,-8.864238,-9.437047
735,2022-10-01,5.21,9.40,14.46,2.631081,8.931982,14.667344
736,2022-11-01,4.01,6.61,5.74,3.359694,5.099187,5.398848


复制效果评价

In [130]:
model_portfolios_three = (smf.ols(
    formula="Lo_30 ~ portfolio_1", 
    data=test_portfolios_three
  )
  .fit()
)
prettify_result(model_portfolios_three)

model_portfolios_three = (smf.ols(
    formula="Med_40 ~ portfolio_2", 
    data=test_portfolios_three
  )
  .fit()
)
prettify_result(model_portfolios_three)

model_portfolios_three = (smf.ols(
    formula="Hi_30 ~ portfolio_3", 
    data=test_portfolios_three
  )
  .fit()
)
prettify_result(model_portfolios_three)

OLS Model:
Lo_30 ~ portfolio_1

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.415       0.027       15.265      0.0
portfolio_1     0.950       0.005      176.116      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.977, Adjusted R-squared: 0.977
- F-statistic: 31,016.734 on 1 and 736 DF, p-value: 0.000

OLS Model:
Med_40 ~ portfolio_2

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.387       0.029       13.156      0.0
portfolio_2     0.966       0.007      143.205      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.965, Adjusted R-squared: 0.965
- F-statistic: 20,507.674 on 1 and 736 DF, p-value: 0.000

OLS Model:
Hi_30 ~ portfolio_3

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.400       0.041        9.643      0.0
portfolio_3     0.975       0.009      112.636      0.0

Summary statistics:
- Number of observations

- 断点为五分位数组合复制  

In [131]:
portfolio_ep_five =(portfolios
    .groupby(["portfolio_ep_five","date"])
    .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)
    .reset_index(name="ret")
    .pivot_table(index='date', columns='portfolio_ep_five', values='ret')
    .reset_index()
)

test_portfolios_five = (ep_portfolios_return
    .get(["date","Lo 20","Qnt 2","Qnt 3","Qnt 4","Hi 20"])
    .rename(columns=lambda x: x.replace(' ', '_'))
    .merge(portfolio_ep_five,how="inner",on=["date"])
    .rename(columns=lambda x: 'portfolio_' + str(x) if str(x).isdigit() else x)
)
test_portfolios_five

  .groupby(["portfolio_ep_five","date"])
  .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)
  .pivot_table(index='date', columns='portfolio_ep_five', values='ret')


Unnamed: 0,date,Lo_20,Qnt_2,Qnt_3,Qnt_4,Hi_20,portfolio_1,portfolio_2,portfolio_3,portfolio_4,portfolio_5
0,1961-07-01,1.51,4.83,3.21,3.70,3.35,1.609198,3.288223,4.412111,2.157260,4.895368
1,1961-08-01,3.69,1.80,2.94,2.00,0.96,3.510890,2.459766,1.598154,1.106798,1.282850
2,1961-09-01,-1.21,-1.79,-1.15,-1.94,-3.42,-1.803861,-4.389419,-0.933440,-2.564272,-0.051680
3,1961-10-01,2.35,2.21,4.79,2.51,4.81,2.570089,3.370393,2.140519,3.916031,1.869106
4,1961-11-01,3.00,5.75,4.10,6.59,4.84,2.879615,4.467820,5.931037,3.563841,6.436270
...,...,...,...,...,...,...,...,...,...,...,...
733,2022-08-01,-5.71,-3.18,-3.64,-2.07,-3.60,-3.349333,-4.958813,-2.642564,-2.955635,-1.760847
734,2022-09-01,-9.47,-9.08,-8.73,-8.51,-8.80,-9.587540,-9.762599,-8.110685,-9.135845,-9.532296
735,2022-10-01,3.54,9.45,7.18,15.40,13.24,0.981562,6.498670,8.370481,12.596605,13.954189
736,2022-11-01,4.48,4.52,7.05,5.50,6.39,-0.986926,4.823030,5.746362,5.760062,5.121205


复制效果评价

In [132]:
model_portfolios_five = (smf.ols(
    formula="Lo_20 ~ portfolio_1", 
    data=test_portfolios_five
  )
  .fit()
)
prettify_result(model_portfolios_five)

model_portfolios_five = (smf.ols(
    formula="Qnt_2 ~ portfolio_2", 
    data=test_portfolios_five
  )
  .fit()
)
prettify_result(model_portfolios_five)

model_portfolios_five = (smf.ols(
    formula="Qnt_3 ~ portfolio_3", 
    data=test_portfolios_five
  )
  .fit()
)
prettify_result(model_portfolios_five)

model_portfolios_five = (smf.ols(
    formula="Qnt_4 ~ portfolio_4", 
    data=test_portfolios_five
  )
  .fit()
)
prettify_result(model_portfolios_five)

model_portfolios_five = (smf.ols(
    formula="Hi_20 ~ portfolio_5", 
    data=test_portfolios_five
  )
  .fit()
)
prettify_result(model_portfolios_five)

OLS Model:
Lo_20 ~ portfolio_1

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.394       0.037       10.722      0.0
portfolio_1     0.931       0.007      136.599      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.962, Adjusted R-squared: 0.962
- F-statistic: 18,659.356 on 1 and 736 DF, p-value: 0.000

OLS Model:
Qnt_2 ~ portfolio_2

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.461       0.040       11.469      0.0
portfolio_2     0.940       0.009      106.497      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.939, Adjusted R-squared: 0.939
- F-statistic: 11,341.506 on 1 and 736 DF, p-value: 0.000

OLS Model:
Qnt_3 ~ portfolio_3

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.329       0.046        7.187      0.0
portfolio_3     0.970       0.011       91.537      0.0

Summary statistics:
- Number of observations:

- 断点为十分位数组合复制  

In [133]:
portfolio_ep_ten =(portfolios
    .groupby(["portfolio_ep_ten","date"])
    .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)
    .reset_index(name="ret")
    .pivot_table(index='date', columns='portfolio_ep_ten', values='ret')
    .reset_index()
)

test_portfolios_ten = (ep_portfolios_return
    .get(["date","Lo 10","2-Dec","3-Dec","4-Dec","5-Dec","6-Dec","7-Dec","8-Dec","9-Dec","Hi 10"])
    .rename(columns=lambda x: x.replace(' ', '_'))
    .rename(columns=lambda x: f"Dec_{x.split('-')[0]}" if '-' in x else x)
    .merge(portfolio_ep_ten,how="inner",on=["date"])
    .rename(columns=lambda x: 'portfolio_' + str(x) if str(x).isdigit() else x)
)
test_portfolios_ten

  .groupby(["portfolio_ep_ten","date"])
  .apply(lambda x:(np.average(x["ret_excess"],weights=x["mktcap_lag"]))*100)
  .pivot_table(index='date', columns='portfolio_ep_ten', values='ret')


Unnamed: 0,date,Lo_10,Dec_2,Dec_3,Dec_4,Dec_5,Dec_6,Dec_7,Dec_8,Dec_9,...,portfolio_1,portfolio_2,portfolio_3,portfolio_4,portfolio_5,portfolio_6,portfolio_7,portfolio_8,portfolio_9,portfolio_10
0,1961-07-01,0.83,2.17,5.11,4.17,3.82,2.48,3.86,3.49,3.15,...,1.955403,1.250844,2.829181,3.627898,4.370170,4.439036,2.431695,1.829122,4.885309,4.925980
1,1961-08-01,4.49,2.93,1.12,3.41,2.51,3.46,2.66,1.14,-0.14,...,3.871256,3.135230,3.937939,1.375085,2.192951,1.216619,0.653147,1.653071,0.789179,2.788284
2,1961-09-01,-0.60,-1.81,-1.46,-2.57,-2.71,0.70,-4.52,1.44,-4.75,...,-0.533698,-3.139717,-3.536219,-5.033193,-3.370391,0.650266,-4.167057,-0.661461,-0.248365,0.535407
3,1961-10-01,2.03,2.65,2.05,2.59,4.69,4.91,3.30,1.52,6.85,...,2.970055,2.138143,2.154351,4.302284,1.361716,2.626111,4.696182,3.019454,1.609519,2.636254
4,1961-11-01,2.12,3.86,6.65,3.64,3.78,4.48,5.16,8.38,4.05,...,2.058833,3.774660,5.443391,3.735653,4.196809,6.999821,3.730052,3.369553,7.432283,3.511403
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
733,2022-08-01,-6.30,-5.38,-3.58,-2.69,-4.57,-2.08,-3.48,-1.18,-4.39,...,4.041291,-4.621357,-5.744424,-4.209094,-2.359046,-2.930196,-2.833299,-3.051485,-1.115943,-2.828256
734,2022-09-01,-9.12,-9.66,-9.86,-8.12,-8.43,-9.21,-7.72,-9.01,-6.72,...,-8.243575,-9.837526,-10.051576,-9.490412,-8.306571,-7.911146,-8.882389,-9.335842,-8.663980,-10.974665
735,2022-10-01,4.21,3.18,8.39,10.72,3.38,13.52,14.95,15.68,13.19,...,4.692199,0.283747,3.423397,9.376092,6.019742,10.756553,9.040467,15.420957,13.313871,15.037373
736,2022-11-01,2.67,5.48,3.15,6.13,7.43,6.47,6.16,5.09,5.91,...,-2.356198,-0.715895,5.395938,4.315884,5.760307,5.732832,5.859490,5.685587,5.288097,4.836824


复制效果评价

In [134]:
model_portfolios_ten = (smf.ols(
    formula="Lo_10 ~ portfolio_1", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_2 ~ portfolio_2", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_3 ~ portfolio_3", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_4 ~ portfolio_4", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_5 ~ portfolio_5", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_6 ~ portfolio_6", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_7 ~ portfolio_7", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_8 ~ portfolio_8", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Dec_9 ~ portfolio_9", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

model_portfolios_ten = (smf.ols(
    formula="Hi_10 ~ portfolio_10", 
    data=test_portfolios_ten
  )
  .fit()
)
prettify_result(model_portfolios_ten)

OLS Model:
Lo_10 ~ portfolio_1

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.552       0.086        6.445      0.0
portfolio_1     0.820       0.013       61.071      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.835, Adjusted R-squared: 0.835
- F-statistic: 3,729.693 on 1 and 736 DF, p-value: 0.000

OLS Model:
Dec_2 ~ portfolio_2

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.344       0.061        5.664      0.0
portfolio_2     0.887       0.012       74.189      0.0

Summary statistics:
- Number of observations: 738
- R-squared: 0.882, Adjusted R-squared: 0.882
- F-statistic: 5,503.979 on 1 and 736 DF, p-value: 0.000

OLS Model:
Dec_3 ~ portfolio_3

Coefficients:
             Estimate  Std. Error  t-Statistic  p-Value
Intercept       0.511       0.055        9.228      0.0
portfolio_3     0.932       0.012       78.559      0.0

Summary statistics:
- Number of observations: 7

总结：  
1. 导入库包  
2. 读取所需数据  
3. 合并数据集（组合构建点）  
4. 投资组合排序（每一周期每一公司）
5. 模拟因子组合的超额收益计算
6. 复制结果与给定因子超额收益数据表合并比较
7. 复制结果评价：普通最小二乘法回归