# Building an Equity Universe with Fama-French 5 Factors

<a href="" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<!-- @import "[TOC]" {cmd="toc" depthFrom=1 depthTo=6 orderedList=false} -->

![]()



## Prepare your Environment

Have a jupyter environment ready, and `pip install` these libraries:
- numpy
- pandas
- yfinance

You'll need access to [analysis_utils](./analysis_utils.py) library for common functions.

In [51]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import os
from tqdm import tqdm

import dotenv
%load_ext dotenv

import warnings
warnings.filterwarnings("ignore")

IS_KAGGLE = os.getenv('IS_KAGGLE', 'True') == 'True'

if IS_KAGGLE:
    # Kaggle confgs
    print('Running in Kaggle...')
    %pip install yfinance
    %pip install statsmodels
    %pip install seaborn
    %pip install itertools
    %pip install scikit-learn

    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))
else:
    print('Running Local...')

import yfinance as yf
from analysis_utils import load_ticker_prices_ts_df, load_ticker_ts_df

os.getcwd()

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Running Local...


'c:\\Users\\adamd\\workspace\\quant_research'

# Factors


1. **Market Risk Premium (RM-RF)**: Excess return of a market portfolio over the risk-free rate.
2. **Size (SMB - Small Minus Big)**: Return difference between small and large firms.
3. **Value (HML - High Minus Low)**: Return difference between high and low book-to-market firms.
4. **Profitability (RMW - Robust Minus Weak)**: Return difference between firms with robust and weak profitability.
5. **Investment (CMA - Conservative Minus Aggressive)**: Return difference between firms with conservative and aggressive investments.

$$
\begin{align*}
R_{it} - R_{ft} &= \alpha_i + \beta_{i,MktRF}(R_{Mt} - R_{ft}) + \beta_{i,SMB}SMB_t + \beta_{i,HML}HML_t + \beta_{i,RMW}RMW_t + \beta_{i,CMA}CMA_t + \epsilon_{it} \\
\text{Where:} \\
R_{it} &\text{ is the return on asset } i \text{ at time } t, \\
R_{ft} &\text{ is the risk-free rate at time } t, \\
R_{Mt} &\text{ is the return on the market portfolio at time } t, \\
MktRF &\text{ is the market risk premium (market return minus the risk-free rate)}, \\
SMB &\text{ represents the size premium (Small Minus Big)}, \\
HML &\text{ represents the value premium (High Minus Low)}, \\
RMW &\text{ represents the profitability premium (Robust Minus Weak)}, \\
CMA &\text{ represents the investment premium (Conservative Minus Aggressive)}, \\
\alpha_i, \beta &\text{ are coefficients}, \\
\epsilon_{it} &\text{ is the error term}.
\end{align*}
$$

## Investable Universe

In [52]:
START_DATE = "2003-01-01"
END_DATE = "2023-12-31"
DATA_DIR = "data"

uni_df = rf_df = None

sp500_df = pd.read_csv(f"./{DATA_DIR}/S&P500_curated_historical_components.csv")
sp500_df["date"] = pd.to_datetime(sp500_df["date"])

sp500_df = sp500_df[(sp500_df["date"] >= START_DATE) & (sp500_df["date"] <= END_DATE)]

tickers_set = set()
sp500_df["tickers"].str.split().apply(tickers_set.update)
tickers = " ".join(tickers_set)
assert len(tickers) > 0

cached_file_path = f"{DATA_DIR}/cache/snp_comps.pkl"
try:
    if os.path.exists(cached_file_path):
        uni_df = pd.read_pickle(cached_file_path)
    else:
        uni_df = yf.download(tickers, start=START_DATE, end=END_DATE, keepna=True)
        uni_df.to_pickle(cached_file_path)
except FileNotFoundError:
    print(f"Error downloading and caching or loading file")

cached_file_path = f"{DATA_DIR}/cache/t10.pkl"
try:
    if os.path.exists(cached_file_path):
        rf_df = pd.read_pickle(cached_file_path)
    else:
        rf_df = yf.download("^TNX", start=START_DATE, end=END_DATE, keepna=True)[
            "Close"
        ]
        rf_df.to_pickle(cached_file_path)
except FileNotFoundError:
    print(f"Error downloading and caching or loading file")

uni_df, rf_df

(             Adj Close                                                       \
                      A AABA    AAL AAMRQ        AAP        AAPL        ABBV   
 Date                                                                          
 2003-01-02   11.612626  NaN    NaN   NaN  14.596296    0.224030         NaN   
 2003-01-03   11.558020  NaN    NaN   NaN  14.370791    0.225543         NaN   
 2003-01-06   12.110137  NaN    NaN   NaN  13.454131    0.225543         NaN   
 2003-01-07   12.000923  NaN    NaN   NaN  13.530277    0.224787         NaN   
 2003-01-08   11.418476  NaN    NaN   NaN  13.445341    0.220245         NaN   
 ...                ...  ...    ...   ...        ...         ...         ...   
 2023-12-22  139.334351  NaN  14.31   NaN  61.250000  193.600006  154.940002   
 2023-12-26  139.573929  NaN  14.11   NaN  60.919998  193.050003  154.619995   
 2023-12-27  139.583923  NaN  13.99   NaN  61.560001  193.149994  154.880005   
 2023-12-28  139.534012  NaN  13.98   Na

## Market Risk Factor

In [53]:
t_df = None
tickers = uni_df.columns.get_level_values(1).unique()
cached_file_path = f"{DATA_DIR}/cache/snp_comps_hist.pkl"
try:
    t_df = yf.Tickers(" ".join(tickers))
    if os.path.exists(cached_file_path):
        t_hist_df = pd.read_pickle(cached_file_path)
    else:
        t_hist_df = t_df.history(start=START_DATE, end=END_DATE)
        t_hist_df.to_pickle(cached_file_path)
except FileNotFoundError:
    print(f"Error downloading and caching or loading file")

t_hist_df

Unnamed: 0_level_0,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,...,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume
Unnamed: 0_level_1,AABA,AAMRQ,ABC,ABI,ABKFQ,ACAS,ADS,AGN,AKS,ALXN,...,XOM,XRAY,XRX,XTO,XYL,YUM,ZBH,ZBRA,ZION,ZTS
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2003-01-02,,,,,,,,,,,...,12798800,1403200,1717769,,,3378461,1762845,800550,794400,
2003-01-03,,,,,,,,,,,...,9221900,1451600,586935,,,3241308,1059870,924525,602300,
2003-01-06,,,,,,,,,,,...,11925100,1551200,1209922,,,2842926,841510,933750,684000,
2003-01-07,,,,,,,,,,,...,14600300,1538000,1201004,,,4670978,644574,982125,560000,
2003-01-08,,,,,,,,,,,...,12677600,1032200,869472,,,3513944,598018,1422675,950200,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-22,,,,,,,,,,,...,12921800,1261200,1223100,,829300.0,991500,1337800,225900,1534500,1548400.0
2023-12-26,,,,,,,,,,,...,16835100,1335200,1154500,,440300.0,627500,1870700,220000,1131600,814600.0
2023-12-27,,,,,,,,,,,...,14558800,1202800,1099000,,1007700.0,1050400,1058600,275700,1345100,766400.0
2023-12-28,,,,,,,,,,,...,16329300,1152300,1152200,,709100.0,882500,662200,193900,1125900,880100.0


In [54]:
mrk_cap = {}
ptb = {}
opm = {}
roa = {}

cached_file_path = f"{DATA_DIR}/cache/fund_snp500.pkl"
try:
    if os.path.exists(cached_file_path):
        fund_df = pd.read_pickle(cached_file_path)
    else:
        for ticker in tqdm(tickers):
            try:
                ticker_info = t_df.tickers[ticker].info
                if ticker_info is None:
                    continue
                mrk_cap[ticker] = ticker_info.get("marketCap", np.nan)
                ptb[ticker] = ticker_info.get("priceToBook", np.nan)
                opm[ticker] = ticker_info.get("operatingMargins", np.nan)
                roa[ticker] = ticker_info.get("returnOnAssets", np.nan)
            except Exception as e:
                print(f"Error processing ticker {ticker}: {e}")

        fund_df = pd.DataFrame(
            {
                "MarketCap": pd.Series(mrk_cap),
                "PriceToBook": pd.Series(ptb),
                "OperatingMargins": pd.Series(opm),
                "ReturnOnAssets": pd.Series(roa),
            }
        )
        fund_df.to_pickle(cached_file_path)
except FileNotFoundError as fnf_error:
    print(f"File not found error: {fnf_error}")
except Exception as e:
    print(f"Error downloading and caching or loading file: {e}")

In [55]:
fund_df["MarketCap"]

A        4.073635e+10
AABA              NaN
AAL      8.979653e+09
AAMRQ             NaN
AAP      3.631407e+09
             ...     
YUM      3.662504e+10
ZBH      2.543299e+10
ZBRA     1.403812e+10
ZION     6.499296e+09
ZTS      9.061532e+10
Name: MarketCap, Length: 933, dtype: float64

In [56]:
mrk_cap = fund_df["MarketCap"]

mkt_weights = mrk_cap / mrk_cap.sum()


weighted_mkt_rets_df = (uni_df["Adj Close"].pct_change() * mkt_weights).sum(axis=1)
mkt_df = weighted_mkt_rets_df.mean()


rm_rf_df = mkt_df - rf_df

rm_rf_df

Date
2003-01-02   -4.031194
2003-01-03   -4.037194
2003-01-05         NaN
2003-01-06   -4.065193
2003-01-07   -4.024194
                ...   
2023-12-25         NaN
2023-12-26   -3.885193
2023-12-27   -3.788193
2023-12-28   -3.849193
2023-12-29   -3.865193
Name: Close, Length: 6421, dtype: float64

## Size Risk Factor

In [57]:
for date, df in tqdm(uni_df.items()):
    median_cap = mrk_cap.median()

    small_companies = mrk_cap[mrk_cap < median_cap]
    big_companies = mrk_cap[mrk_cap >= median_cap]

small_companies, big_companies

5694it [00:02, 2029.15it/s]


(AAL     8.979653e+09
 AAP     3.631407e+09
 ABMD    1.718065e+10
 ACV     2.174619e+08
 ADCT    1.338665e+08
             ...     
 X       1.085552e+10
 XRAY    7.540098e+09
 XRX     2.252867e+09
 ZBRA    1.403812e+10
 ZION    6.499296e+09
 Name: MarketCap, Length: 327, dtype: float64,
 A       4.073635e+10
 AAPL    2.994380e+12
 ABBV    2.736057e+11
 ABNB    8.725567e+10
 ABT     1.910881e+11
             ...     
 XOM     3.995971e+11
 XYL     2.756968e+10
 YUM     3.662504e+10
 ZBH     2.543299e+10
 ZTS     9.061532e+10
 Name: MarketCap, Length: 327, dtype: float64)

## Value Risk Factor

In [58]:
ptb = fund_df["PriceToBook"]

median_rank = ptb.median()
low_bm_companies = ptb[ptb < median_rank]
high_bm_companies = ptb[ptb >= median_rank]


high_bm_companies, low_bm_companies

(A        6.948373
 AAPL    48.168625
 ABBV    22.623358
 ABNB     9.535617
 ABT      5.098194
           ...    
 WMT      5.343162
 WST      9.084858
 WU       7.124926
 ZBRA     4.659166
 ZTS     17.859922
 Name: PriceToBook, Length: 302, dtype: float64,
 AAP     1.362429
 ACGL    1.911958
 ADI     2.770630
 ADM     1.531545
 ADT     1.921127
           ...   
 XRAY    2.233448
 XRX     0.807774
 XYL     2.773170
 ZBH     2.033077
 ZION    1.333151
 Name: PriceToBook, Length: 301, dtype: float64)

## Profitability Risk Factor

In [59]:
opm = fund_df["OperatingMargins"]

median_rank = opm.median()
low_bm_companies = opm[opm < median_rank]
high_bm_companies = opm[opm >= median_rank]

high_bm_companies, low_bm_companies

(A       0.26836
 AAPL    0.30134
 ABBV    0.17757
 ABNB    0.44039
 ABT     0.17441
          ...   
 WU      0.19585
 XEL     0.26051
 YUM     0.36593
 ZION    0.31492
 ZTS     0.37099
 Name: OperatingMargins, Length: 327, dtype: float64,
 AAL     0.05504
 AAP    -0.01607
 ACN     0.16670
 ACV     0.00000
 ADCT   -2.58083
          ...   
 XRAY    0.08237
 XRX     0.03148
 XYL     0.10212
 ZBH     0.17205
 ZBRA    0.05021
 Name: OperatingMargins, Length: 326, dtype: float64)

## Investment Risk Factor

In [60]:
roa = fund_df["ReturnOnAssets"]

median_rank = roa.median()
cons_companies = roa[roa < median_rank]
aggro_companies = roa[roa >= median_rank]

cons_companies, aggro_companies

(AAL     0.04601
 AAP     0.02004
 ACGL    0.04064
 ADCT   -0.17433
 ADM     0.04543
          ...   
 XRAY    0.02462
 XRX     0.02327
 XYL     0.03957
 ZBH     0.04243
 ZION    0.00954
 Name: ReturnOnAssets, Length: 324, dtype: float64,
 A       0.08253
 AAPL    0.20256
 ABBV    0.08654
 ABNB    0.07502
 ABT     0.05723
          ...   
 WU      0.06029
 XOM     0.08879
 YUM     0.24863
 ZBRA    0.06470
 ZTS     0.13760
 Name: ReturnOnAssets, Length: 325, dtype: float64)

# Rank Factors

# Conclusion




![]()

## References

- [Eugene F. Fama, Kenneth R. French,Common risk factors in the returns on stocks and bonds](https://www.sciencedirect.com/science/article/pii/0304405X93900235)
- [Eugene F. Fama, Kenneth R. French, A five-factor asset pricing model](https://www.sciencedirect.com/science/article/pii/S0304405X14002323)
- [YFinance Github](https://github.com/ranaroussi/yfinance)
- [List of S&P 500 companies](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)


## Github

Article here is also available on [Github]()

Kaggle notebook available [here]()


## Media

All media used (in the form of code or images) are either solely owned by me, acquired through licensing, or part of the Public Domain and granted use through Creative Commons License.

## CC Licensing and Use

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.