## Short Interest Residualization on Market Cap

Short interest (% of float) can be influenced by market cap. For example, small-cap stocks often have higher short interest. 
If you don’t account for market cap, your model might just be picking up the effect of market cap instead of the real predictive power of short interest.

Residualizing means removing the part of short interest that is explained by market cap, so what remains is the "pure" short interest effect.

## Load Raw Data

In [13]:
import pandas as pd
import os

# Define base data directory
RAW_DATA_DIR = os.path.abspath("../../data/raw")
PROCCESSED_DATA_DIR = os.path.abspath("../../data/processed")

# Load datasets
sage_factors = pd.read_csv(os.path.join(PROCCESSED_DATA_DIR, "sage_factors.csv"))
short_interest = sage_factors[["MONTH_END", "COMPANY_ID", "SHORT_INTEREST_PCT_FLOAT"]]
short_interest.loc[:, "MONTH_END"] = pd.to_datetime(short_interest["MONTH_END"]) + pd.offsets.MonthEnd(0)
print("short_interest shape: ", short_interest.shape)
assert short_interest.shape == (120624, 3)

market_cap = pd.read_csv(os.path.join(RAW_DATA_DIR, "market_cap.csv"))
market_cap.loc[:, "MONTH_END"] = pd.to_datetime(market_cap["DAY_DATE"]) + pd.offsets.MonthEnd(0)
market_cap = market_cap.drop(columns=["DAY_DATE"])
print("market_cap shape: ", market_cap.shape)
assert market_cap.shape == (143295, 3)

short_interest shape:  (120624, 3)
market_cap shape:  (143295, 3)


## Merge and Clean Raw Data

In [19]:
# Ensure MONTH_END is in datetime format in both datasets
short_interest.loc[:, 'MONTH_END'] = pd.to_datetime(short_interest['MONTH_END'], errors='coerce')
market_cap.loc[:, 'MONTH_END'] = pd.to_datetime(market_cap['MONTH_END'], errors='coerce')

# Now merge the datasets
merged_data = pd.merge(short_interest, market_cap, on=['COMPANY_ID', 'MONTH_END'], how='left')

# Create month and year columns
merged_data['YEAR'] = merged_data['MONTH_END'].dt.year
merged_data['MONTH'] = merged_data['MONTH_END'].dt.month

# Drop duplicates based on COMPANY_ID, YEAR, and MONTH, keeping the first occurrence
merged_data = merged_data.drop_duplicates(subset=['COMPANY_ID', 'YEAR', 'MONTH'], keep='first')

# Drop helper columns, including any unintended duplicates
merged_data = merged_data.drop(columns=['YEAR', 'MONTH', 'PRICING_DATE'], errors='ignore')

# Drop rows with missing values in the columns of interest
merged_data = merged_data.dropna(subset=['SHORT_INTEREST_PCT_FLOAT', 'MARKET_CAP_USD'])

In [21]:
print(merged_data.head().to_markdown())

|     | MONTH_END           |   COMPANY_ID |   SHORT_INTEREST_PCT_FLOAT |   MARKET_CAP_USD |
|----:|:--------------------|-------------:|---------------------------:|-----------------:|
| 204 | 2005-02-28 00:00:00 |        24153 |                0.000397153 |          269.641 |
| 205 | 2005-03-31 00:00:00 |        24153 |                0.000444213 |          219.146 |
| 206 | 2005-04-30 00:00:00 |        24153 |                0.000425021 |          201.978 |
| 207 | 2005-05-31 00:00:00 |        24153 |                0.000329848 |          267.687 |
| 208 | 2005-06-30 00:00:00 |        24153 |                0.000313068 |          318.578 |


In [23]:
import statsmodels.api as sm
import pandas as pd

# Ensure numeric types (convert non-numeric values to NaN)
merged_data['MARKET_CAP_USD'] = pd.to_numeric(merged_data['MARKET_CAP_USD'], errors='coerce')
merged_data['SHORT_INTEREST_PCT_FLOAT'] = pd.to_numeric(merged_data['SHORT_INTEREST_PCT_FLOAT'], errors='coerce')

# Drop rows where either column is NaN after conversion
merged_data = merged_data.dropna(subset=['MARKET_CAP_USD', 'SHORT_INTEREST_PCT_FLOAT'])

# Define independent (X) and dependent (y) variables
X = merged_data['MARKET_CAP_USD']
y = merged_data['SHORT_INTEREST_PCT_FLOAT']

# Add a constant to X for intercept
X = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X, missing='drop').fit()

# Store residuals in a new column
merged_data['SHORT_INTEREST_RESIDUALIZED'] = model.resid

In [25]:
filtered_merged_data = merged_data[['MONTH_END', 'COMPANY_ID', 'SHORT_INTEREST_RESIDUALIZED']]

In [26]:
PROCESSED_DATA_DIR = os.path.abspath("../../data/processed")

filtered_merged_data.to_csv(os.path.join(PROCESSED_DATA_DIR, "residualized_short_interest.csv"), index=False)