# Quick Tour

How to get around the CZ dataset using **openassetpricing**. 
tbc: installation instructions

In [2]:
# Set up environment
import pandas as pd
import numpy as np
import openassetpricing as oap
import statsmodels.formula.api as smf

# Initialize OpenAP using the 2024 data release 
openap = oap.OpenAP()

# Navigating Signal Doc

The CZ dataset is organized around "signals." Each signal is described in the signal doc, which can be downloaded using `openap.dl_signal_doc("pandas")`. 

In [4]:
# Download signal doc
signaldoc = openap.dl_signal_doc("pandas")

# show a few rows
signaldoc.head(3)

Unnamed: 0,Acronym,Cat.Signal,Predictability in OP,Signal Rep Quality,Authors,Year,LongDescription,Journal,Cat.Form,Cat.Data,...,Return,T-Stat,Stock Weight,LS Quantile,Quantile Filter,Portfolio Period,Start Month,Filter,Notes,Detailed Definition
0,AbnormalAccruals,Predictor,1_clear,2_fair,Xie,2001,Abnormal Accruals,AR,continuous,Accounting,...,0.916666667,8.43,EW,0.1,,12,6,,OP is aggressive and lags accounting data by o...,Define Accruals as net income (ib) minus opera...
1,Accruals,Predictor,1_clear,1_good,Sloan,1996,Accruals,AR,continuous,Accounting,...,0.866666667,4.71,EW,0.1,,12,6,abs(prc)>5,Table 6 year t+1 hedge. Only size adjusted an...,Annual change in current total assets (act) mi...
2,AccrualsBM,Predictor,1_clear,1_good,Bartov and Kim,2004,Book-to-market and accruals,RFQA,discrete,Accounting,...,0.206,5.5,EW,0.2,,12,6,,,Binary variable equal to 1 if stock is in the ...


Let's take a closer look at the `AssetGrowth` predictor. `signaldoc` has lots of info about this predictor. It provides the key table demonstrating predictability, as well as a summary of the evidence for predictability.

In [14]:
# show what signaldoc tells us about AssetGrowth
signaldoc[signaldoc["Acronym"]=="AssetGrowth"].T

Unnamed: 0,7
Acronym,AssetGrowth
Cat.Signal,Predictor
Predictability in OP,1_clear
Signal Rep Quality,1_good
Authors,"Cooper, Gulen and Schill"
Year,2008
LongDescription,Asset growth
Journal,JF
Cat.Form,continuous
Cat.Data,Accounting


# Navigating Portfolio Returns

For a given signal, there are many ways to implement portfolios. The CZ replication paper focuses on the "Original Paper" (op) implementations. These follow the "Key Table in OP" as found in `signaldoc`.

We saw above that AssetGrowth has an enormous t-stat of 8.5. Let's see how well the CZ replication matches this result, using the "op" implementation.

In [7]:
# download original paper (op) portfolios for AssetGrowth
port_op = openap.dl_port('op', 'pandas', ['AssetGrowth'])


Data is downloaded: 6s
t-stat is  7.644285343661306


In [None]:
# filter for long-short portfolios in sample
longshort_insamp = port_op[
    (port_op["port"]=='LS') & (port_op["date"]>='1968-01-01') & (port_op["date"]<='2003-12-31') 
]

# regress ret on constant
ols = smf.ols(formula='ret ~ 1', data=longshort_insamp)

result = ols.fit()

print("t-stat is ", result.tvalues["Intercept"])

The t-stat of 7.6 is not quite as large as the 8.5 in the original paper, but it is in the same ballpark. 

Using `signaldoc`, we saw that Cooper et al. focused on equal-weighting, decile-sorts, no special filters, 12-month rebalancing.  

How does it perform under other implementations? Let's see what implementations are available in the CZ dataset.

In [112]:
openap.list_port()

┌─────────────────────────────────────────────────┬─────────────────────┐
│ CZ portfolio file                               │ Name for download   │
├─────────────────────────────────────────────────┼─────────────────────┤
│ PredictorAltPorts_Deciles.zip                   │ deciles_ew          │
│ PredictorAltPorts_DecilesVW.zip                 │ deciles_vw          │
│ PredictorAltPorts_LiqScreen_ME_gt_NYSE20pct.zip │ ex_nyse_p20_me      │
│ PredictorAltPorts_LiqScreen_NYSEonly.zip        │ nyse                │
│ PredictorAltPorts_LiqScreen_Price_gt_5.zip      │ ex_price5           │
│ PredictorAltPorts_Quintiles.zip                 │ quintiles_ew        │
│ PredictorAltPorts_QuintilesVW.zip               │ quintiles_vw        │
│ PredictorPortsFull.csv                          │ op                  │
└─────────────────────────────────────────────────┴─────────────────────┘


There are lots of flavors of portfolio implementations above. Let's check out value-weighted deciles (`deciles_vw`) as well as a filter for market equity > the NYSE 20th percentile (`ex_nyse_p20_me`)

In [10]:
# download alternative implementations
port_vw = openap.dl_port('deciles_vw', 'pandas', ['AssetGrowth'])
port_mescreen = openap.dl_port('ex_nyse_p20_me', 'pandas', ['AssetGrowth'])


Data is downloaded: 3s

Data is downloaded: 6s


In [15]:
# append implementations
port_all = pd.concat([
    port_op.assign(imp='op'),
    port_vw.assign(imp='deciles_vw'),
    port_mescreen.assign(imp='ex_nyse_p20_me')
])

# filter for long-short portfolios in sample
port_all = port_all[
    (port_all["port"]=='LS') & 
    (port_all["date"]>='1968-01-01') & 
    (port_all["date"]<='2003-12-31')
]   

# regress ret on constant by group
for imp in port_all["imp"].unique():
    print(imp)
    longshort_insamp = port_all[port_all["imp"]==imp]
    ols = smf.ols(formula='ret ~ 1', data=longshort_insamp)
    result = ols.fit()
    print("t-stat is ", result.tvalues["Intercept"])


op
t-stat is  7.644285343661306
deciles_vw
t-stat is  4.247514977233889
ex_nyse_p20_me
t-stat is  6.06327571282444


As expected, these liquidity adjustments lead to smaller t-stats. Value-weighting is a much more severe liquidity adjustment compared to removing stocks below the 20th percentile of NYSE market equity.

You can also download all portfolios using a particular implementation by omitting the signal names.

In [20]:
# download all original paper portfolios
allport_op = openap.dl_port('op', 'pandas')

# download all decile-value-weighted portfolios
allport_vw = openap.dl_port('deciles_vw', 'pandas')


Data is downloaded: 5s

Data is downloaded: 5s


# Navigating Signal Data (a.k.a. firm-level characteristics)

To download the signal data use `openap.dl_signal()`. 


In [18]:
# Download AssetGrowth signals
signal = openap.dl_signal('pandas', ['AssetGrowth'])

signal.head()


Data is downloaded: 7s


Unnamed: 0,permno,yyyymm,AssetGrowth
0,10001,198712,-0.038474
1,10001,198801,-0.038474
2,10001,198802,-0.038474
3,10001,198803,-0.038474
4,10001,198804,-0.038474


The first column above means that, in the end of December 1987, permno 10001 had an AssetGrowth signal (firm characteristic) of -0.038. So one can predict permno 10001's return in January 1988, or any month going forward, using this number.

Following Fama and French (1992), the CZ data lags annual Compustat variables by 6 months, and then uses the signal for another 12 months. One can see the timing in the Github code here: https://github.com/OpenSourceAP/CrossSection/blob/master/Signals/Code/DataDownloads/B_CompustatAnnual.do

One can also download all predictor signals at once. But this requires a lot of ram, can take a few minutes, and also requires a WRDS account. Thus, it is done using a distinct function (`dl_all_signals` instead of `dl_signal`).


In [23]:
# download all signals at once
allsignal = openap.dl_all_signals('pandas')

WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done

Data is downloaded: 3 mins


In [24]:
# show first few rows
allsignal.head()

Unnamed: 0,permno,yyyymm,AM,AOP,AbnormalAccruals,Accruals,AccrualsBM,Activism1,Activism2,AdExp,...,sinAlgo,skew1,std_turn,tang,zerotrade12M,zerotrade1M,zerotrade6M,Price,Size,STreversal
0,10000,198601,,,,,,,,,...,,,,,,,,-1.475907,-2.778819,-0.0
1,10000,198602,,,,,,,,,...,,,,,,4.785175e-08,,-1.178655,-2.481568,0.257143
2,10000,198603,,,,,,,,,...,,,,,,1.023392e-07,,-1.490091,-2.793004,-0.365385
3,10000,198604,,,,,,,,,...,,,,,,7.467463e-08,,-1.386294,-2.719452,0.098592
4,10000,198605,,,,,,,,,...,,,,,,7.649551e-08,,-1.134423,-2.467581,0.222656


One thing that pops out from downloading all signals is that you see a lot of missing values. This is not usually an issue with single predictor studies, as you can just drop the missing values and study the stocks that have the signal data. 

But when looking at many signals, dropping stocks missing any signal means dropping the vast majority of stocks. See [Chen and McCoy (2024)](https://arxiv.org/pdf/2207.13071) and cites therein for how to handle this.