## Momentum Feature Creation for Modalities and Indications

The momentum feature is calculated using the 12-month rolling lag returns for each modality and indication, then averaged across a company's active features. 
This approach captures momentum at the indication and modality level rather than broadly across the sector or for an individual stock.
It reflects how certain areas gain traction at different times.

## Load Raw Data

In [1]:
import pandas as pd
import os

# Define base data directory
RAW_DATA_DIR = os.path.abspath("../../data/raw")

# Load datasets
returns = pd.read_csv(os.path.join(RAW_DATA_DIR, "closing_prices.csv"))
print("Returns shape: ", returns.shape)
assert returns.shape == (334817, 3)

mod_ind = pd.read_csv(os.path.join(RAW_DATA_DIR, "mod_ind.csv"))
print("mod_ind shape: ", mod_ind.shape)
assert mod_ind.shape == (1720, 30)

# NOTE: 
# - Returns are calcualted from PRICE_CLOSE_USD the BIOTECH_PROJECT.READ_ONLY.RETURNS and not EXCESS_RETURN_USD_LN from BIOTECH_PROJECT.INTERNS.BIOTECH_FACTORS
# - Some companies were not captured in the  BIOTECH_PROJECT.READ_ONLY.DESCRIPTIONS dataset that exist in the BIOTECH_PROJECT.INTERNS.BIOTECH_FACTORS/sage dataset
#   for these companies they are 

Returns shape:  (334817, 3)
mod_ind shape:  (1720, 30)


## Merge and Clean Raw Data

In [2]:
# Merge datasets while avoiding duplicate COMPANY_ID columns
merged_data = pd.merge(returns, mod_ind, on='COMPANY_ID', how='left')

# Convert PRICING_DATE to datetime and rename column
merged_data['MONTH_END'] = pd.to_datetime(merged_data['PRICING_DATE']) + pd.offsets.MonthEnd(0)

# Create month and year columns
merged_data['YEAR'] = merged_data['MONTH_END'].dt.year
merged_data['MONTH'] = merged_data['MONTH_END'].dt.month

# Drop duplicates based on COMPANY_ID, YEAR, and MONTH, keeping the first occurrence
merged_data = merged_data.drop_duplicates(subset=['COMPANY_ID', 'YEAR', 'MONTH'], keep='first')

# Drop helper columns, including any unintended duplicates
merged_data = merged_data.drop(columns=['YEAR', 'MONTH', 'PRICING_DATE'], errors='ignore')

In [3]:
print(merged_data.head(10).to_markdown())

|    |   COMPANY_ID |   PRICE_CLOSE_USD |   ONCOLOGY |   NEUROLOGY/PSYCHIATRY |   CARDIOVASCULAR/METABOLIC |   IMMUNOLOGY/AUTOIMMUNE |   INFECTIOUS DISEASES |   HEMATOLOGY |   GASTROINTESTINAL/HEPATOLOGY |   DERMATOLOGY |   OPHTHALMOLOGY |   RESPIRATORY |   UROLOGY/RENAL |   PAIN MANAGEMENT/ANESTHETICS |   PROTEIN/MONOCLONAL ANTIBODIES |   PROTEIN/OTHERS |   PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES |   PEPTIDES/AMINO ACIDS WITH FEWER THAN 40 RESIDUES (SHORT PEPTIDES) |   PEPTIDES/CYCLIC |   PEPTIDES/PEGYLATED |   PEPTIDES/OTHERS |   SMALL MOLECULES AND NATURAL PRODUCTS |   NUCLEIC ACIDS/MRNA |   NUCLEIC ACIDS/SIRNA |   NUCLEIC ACIDS/ASO (ANTISENSE OLIGONUCLEOTIDES) |   NUCLEIC ACIDS/OTHERS |   NUCLEIC ACIDS/OLIGONUCLEOTIDES (OLIGOS) |   CELL AND GENE THERAPY |   VACCINES |   IMAGING AGENTS |   COMBINATION THERAPIES | MONTH_END           |
|---:|-------------:|------------------:|-----------:|-----------------------:|---------------------------:|------------------------:|--------

## Calculate Rolling Returns Over Windows

In [6]:
# Sort the DataFrame by COMPANY_ID and DATE (ascending) for proper return calculations.
merged_data = merged_data.sort_values(by=['COMPANY_ID', 'MONTH_END'], ascending=True)

# Compute the monthly return as the percentage change in PRICE_CLOSE_USD for each company.
merged_data['RETURN'] = merged_data.groupby('COMPANY_ID')['PRICE_CLOSE_USD'].pct_change()

# Define different rolling window sizes
window_sizes = [3, 6, 12, 24, 48]

# Compute rolling average returns for different window sizes
for window in window_sizes:
    column_name = f'rolling_return_{window}m'
    merged_data[column_name] = (
        merged_data.groupby('COMPANY_ID')['RETURN']
        .apply(lambda x: x.rolling(window=window, min_periods=1).mean().shift(1))
        .reset_index(level=0, drop=True)
    )

# Create a month-year period column (ignoring the day)
merged_data['month_year'] = merged_data['MONTH_END'].dt.to_period('M')

In [7]:
print(merged_data.head(10).to_markdown())

|    |   COMPANY_ID |   PRICE_CLOSE_USD |   ONCOLOGY |   NEUROLOGY/PSYCHIATRY |   CARDIOVASCULAR/METABOLIC |   IMMUNOLOGY/AUTOIMMUNE |   INFECTIOUS DISEASES |   HEMATOLOGY |   GASTROINTESTINAL/HEPATOLOGY |   DERMATOLOGY |   OPHTHALMOLOGY |   RESPIRATORY |   UROLOGY/RENAL |   PAIN MANAGEMENT/ANESTHETICS |   PROTEIN/MONOCLONAL ANTIBODIES |   PROTEIN/OTHERS |   PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES |   PEPTIDES/AMINO ACIDS WITH FEWER THAN 40 RESIDUES (SHORT PEPTIDES) |   PEPTIDES/CYCLIC |   PEPTIDES/PEGYLATED |   PEPTIDES/OTHERS |   SMALL MOLECULES AND NATURAL PRODUCTS |   NUCLEIC ACIDS/MRNA |   NUCLEIC ACIDS/SIRNA |   NUCLEIC ACIDS/ASO (ANTISENSE OLIGONUCLEOTIDES) |   NUCLEIC ACIDS/OTHERS |   NUCLEIC ACIDS/OLIGONUCLEOTIDES (OLIGOS) |   CELL AND GENE THERAPY |   VACCINES |   IMAGING AGENTS |   COMBINATION THERAPIES | MONTH_END           |       RETURN |   rolling_return_3m |   rolling_return_6m |   rolling_return_12m |   rolling_return_24m |   rolling_return_48m | month_year   |

## Load Features

In [11]:
# Load the modalities and indications from the resources folder.
RESOURCES_DIR = os.path.abspath("../../resources")
with open(os.path.join(RESOURCES_DIR, "modaliities.txt"), "r") as f:
    modalities = [line.strip().upper() for line in f]
with open(os.path.join(RESOURCES_DIR, "indications.txt"), "r") as f:
    indications = [line.strip().upper() for line in f]

mod_factors = [m.upper() for m in modalities] 
ind_factors = [i.upper() for i in indications]
mod_ind_factors = mod_factors + ind_factors
print(f'Using {len(mod_ind_factors)} binary features: {mod_ind_factors}')

Using 29 binary features: ['PROTEIN/MONOCLONAL ANTIBODIES', 'PROTEIN/OTHERS', 'PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES', 'PEPTIDES/AMINO ACIDS WITH FEWER THAN 40 RESIDUES (SHORT PEPTIDES)', 'PEPTIDES/CYCLIC', 'PEPTIDES/PEGYLATED', 'PEPTIDES/OTHERS', 'SMALL MOLECULES AND NATURAL PRODUCTS', 'NUCLEIC ACIDS/MRNA', 'NUCLEIC ACIDS/SIRNA', 'NUCLEIC ACIDS/ASO (ANTISENSE OLIGONUCLEOTIDES)', 'NUCLEIC ACIDS/OTHERS', 'NUCLEIC ACIDS/OLIGONUCLEOTIDES (OLIGOS)', 'CELL AND GENE THERAPY', 'VACCINES', 'IMAGING AGENTS', 'COMBINATION THERAPIES', 'ONCOLOGY', 'NEUROLOGY/PSYCHIATRY', 'CARDIOVASCULAR/METABOLIC', 'IMMUNOLOGY/AUTOIMMUNE', 'INFECTIOUS DISEASES', 'HEMATOLOGY', 'GASTROINTESTINAL/HEPATOLOGY', 'DERMATOLOGY', 'OPHTHALMOLOGY', 'RESPIRATORY', 'UROLOGY/RENAL', 'PAIN MANAGEMENT/ANESTHETICS']


## Calculate Individual Feature Momentum for Each Window

In [12]:
import numpy as np

# Build a DataFrame to store momentum scores for each month-year and feature
# Rows: each unique month_year
# Columns: momentum for each feature across different rolling window sizes
momentum_records = []

# Iterate through each month_year group in the dataset
for period, group in merged_data.groupby('month_year'):
    record = {'month_year': period}  # Initialize a record for the current period
    
    # Compute momentum for each binary feature across different rolling windows
    for feature in mod_ind_factors:
        if feature in group.columns:
            feature_mask = group[feature] == 1  # Identify rows where the feature is active
            
            for window in window_sizes:
                column_name = f'rolling_return_{window}m'  # Define the rolling return column name
                
                if column_name in group.columns:
                    if feature_mask.sum() > 0:
                        # Calculate the mean rolling return for companies where the feature is active
                        momentum = group.loc[feature_mask, column_name].mean()
                    else:
                        # Assign NaN if no companies have the feature active
                        momentum = np.nan
                    
                    # Store the calculated momentum in the record
                    record[f'momentum_{feature}_{window}m'] = momentum
                else:
                    # Assign NaN if the rolling return column does not exist
                    record[f'momentum_{feature}_{window}m'] = np.nan
        else:
            # Assign NaN for missing features in the dataset
            for window in window_sizes:
                record[f'momentum_{feature}_{window}m'] = np.nan
    
    # Append the record to the list of momentum records
    momentum_records.append(record)

# Convert the list of records into a DataFrame, using month_year as the index
momentum_df = pd.DataFrame(momentum_records).set_index('month_year')

# Sort the DataFrame in descending order of month_year for better readability
momentum_df = momentum_df.sort_index(ascending=False)

In [13]:
print(momentum_df.head(10).to_markdown())

| month_year   |   momentum_PROTEIN/MONOCLONAL ANTIBODIES_3m |   momentum_PROTEIN/MONOCLONAL ANTIBODIES_6m |   momentum_PROTEIN/MONOCLONAL ANTIBODIES_12m |   momentum_PROTEIN/MONOCLONAL ANTIBODIES_24m |   momentum_PROTEIN/MONOCLONAL ANTIBODIES_48m |   momentum_PROTEIN/OTHERS_3m |   momentum_PROTEIN/OTHERS_6m |   momentum_PROTEIN/OTHERS_12m |   momentum_PROTEIN/OTHERS_24m |   momentum_PROTEIN/OTHERS_48m |   momentum_PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES_3m |   momentum_PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES_6m |   momentum_PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES_12m |   momentum_PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES_24m |   momentum_PEPTIDES/AMINO ACIDS WITH MORE THAN 40 RESIDUES_48m |   momentum_PEPTIDES/AMINO ACIDS WITH FEWER THAN 40 RESIDUES (SHORT PEPTIDES)_3m |   momentum_PEPTIDES/AMINO ACIDS WITH FEWER THAN 40 RESIDUES (SHORT PEPTIDES)_6m |   momentum_PEPTIDES/AMINO ACIDS WITH FEWER THAN 40 RESIDUES (SHORT PEPTIDES)_12m |   momentum_PEPTIDES/AM

## Calculate Individual Company Momentum for Each Window
Average of it's active features' momentum at each point in time

In [22]:
import numpy as np

# Build a DataFrame to store momentum scores for each month-year and feature
# Rows: each unique month_year
# Columns: momentum for each feature across different rolling window sizes
momentum_records = []

# Iterate through each month_year group in the dataset
for period, group in merged_data.groupby('month_year'):
    record = {'month_year': period}  # Initialize a record for the current period
    
    # Compute momentum for each binary feature across different rolling windows
    for feature in mod_ind_factors:
        if feature in group.columns:
            feature_mask = group[feature] == 1  # Identify rows where the feature is active
            
            for window in window_sizes:
                column_name = f'rolling_return_{window}m'  # Define the rolling return column name
                
                if column_name in group.columns:
                    if feature_mask.sum() > 0:
                        # Calculate the mean rolling return for companies where the feature is active
                        momentum = group.loc[feature_mask, column_name].mean()
                    else:
                        # Assign NaN if no companies have the feature active
                        momentum = np.nan
                    
                    # Store the calculated momentum in the record
                    record[f'momentum_{feature}_{window}m'] = momentum
                else:
                    # Assign NaN if the rolling return column does not exist
                    record[f'momentum_{feature}_{window}m'] = np.nan
        else:
            # Assign NaN for missing features in the dataset
            for window in window_sizes:
                record[f'momentum_{feature}_{window}m'] = np.nan
    
    # Append the record to the list of momentum records
    momentum_records.append(record)

# Convert the list of records into a DataFrame, using month_year as the index
momentum_df = pd.DataFrame(momentum_records).set_index('month_year')

# Sort the DataFrame in descending order of month_year for better readability
momentum_df = momentum_df.sort_index(ascending=False)


def compute_feature_momentum(row, feature_list, window):
    """
    Computes the average momentum for a given row based on active features and a specified rolling window.
    
    Parameters:
    row (pd.Series): A row from the dataset containing feature indicators.
    feature_list (list): A list of feature names to check for momentum.
    window (int): The rolling window size to use for momentum calculation.
    
    Returns:
    float: The mean momentum value for the active features, or NaN if no features are active.
    """
    period = row['month_year']
    
    # If the period is not present in momentum_df, return NaN.
    if period not in momentum_df.index:
        return np.nan
    
    # Retrieve the momentum values for this period.
    period_momentum = momentum_df.loc[period]
    momentums = []
    
    # Iterate through each feature and check if it is active.
    for feature in feature_list:
        if row.get(feature, 0) == 1:  # Check if the feature is marked as active (1)
            momentum_val = period_momentum.get(f'momentum_{feature}_{window}m', np.nan)
            if not pd.isna(momentum_val):
                momentums.append(momentum_val)
    
    # Return the average momentum for active features, or NaN if none are active.
    return np.mean(momentums) if momentums else np.nan

# Compute average momentum separately for modalities and indications across different rolling windows
for window in window_sizes:
    merged_data[f'avg_modality_momentum_{window}m'] = merged_data.apply(lambda row: compute_feature_momentum(row, mod_factors, window), axis=1)
    merged_data[f'avg_indication_momentum_{window}m'] = merged_data.apply(lambda row: compute_feature_momentum(row, ind_factors, window), axis=1)

## Remove Unessesary Columns

In [23]:
# Select relevant columns, including momentum for all window sizes
momentum_columns = [f'avg_modality_momentum_{window}m' for window in window_sizes] + [f'avg_indication_momentum_{window}m' for window in window_sizes]
filtered_merged_data = merged_data[['COMPANY_ID', 'MONTH_END', 'PRICE_CLOSE_USD'] + momentum_columns]

In [24]:
print(filtered_merged_data.head(10).to_markdown())

|    |   COMPANY_ID | MONTH_END           |   PRICE_CLOSE_USD |   avg_modality_momentum_3m |   avg_modality_momentum_6m |   avg_modality_momentum_12m |   avg_modality_momentum_24m |   avg_modality_momentum_48m |   avg_indication_momentum_3m |   avg_indication_momentum_6m |   avg_indication_momentum_12m |   avg_indication_momentum_24m |   avg_indication_momentum_48m |
|---:|-------------:|:--------------------|------------------:|---------------------------:|---------------------------:|----------------------------:|----------------------------:|----------------------------:|-----------------------------:|-----------------------------:|------------------------------:|------------------------------:|------------------------------:|
|  0 |        21609 | 1968-01-31 00:00:00 |           4.18751 |               nan          |              nan           |                nan          |                nan          |                nan          |                 nan          |                  

## Output Processed Data

In [25]:
PROCESSED_DATA_DIR = os.path.abspath("../../data/processed")

filtered_merged_data.to_csv(os.path.join(PROCESSED_DATA_DIR, "mod_ind_momentum.csv"), index=False)

In [None]:
# 2/24 Meeting Notes
    # trend based   
    # test 1 and 12 - 1 (big reversal)
    # one month reversal and 12 month continuance 
    # use anywhere from 3 - 12