Implementation of 101 formulaic Alphas from paper https://arxiv.org/ftp/arxiv/papers/1601/1601.00991.pdf
Zura Kakushadze:

"""
We emphasize that the 101 alphas we present here are not “toy” alphas but real-life trading
alphas used in production. In fact, 80 of these alphas are in production as of this writing.22 To
our knowledge, this is the first time such a large number of real-life explicit formulaic alphas
appear in the literature. This should come as no surprise: naturally, quant trading is highly
proprietary and secretive. Our goal here is to provide a glimpse into the complex world of
modern and ever-evolving quantitative trading and help demystify it, to any degree pos

"""sFunction	Definition
ts_{O}(x, d)	Operator O applied to the time-series for the past d days; non-integer number of days d is converted to floor(d)
ts_lag(x, d)	Value of x d days ago
ts_delta(x, d)	Difference between the value of x today and d days ago
ts_weighted_mean(x, d)	Weighted moving average over the past d days with linearly decaying weights d, d – 1, …, 1 (rescaled to sum up to 1)
ts_sum(x, d)	Rolling sum over the past d days
ts_product(x, d)	Rolling product over the past d days
ts_stddev(x, d)	Moving standard deviation over the past d days
ts_rank(x, d)	Rank over the past d days
ts_min(x, d)	Rolling min over the past d days [alias: min(x, d)]
ts_max(x, d)	Rolling max over the past d days [alias: max(x, d)]
ts_argmax(x, d)	Day of ts_max(x, d)
ts_argmin(x, d)	Day of ts_min(x, d)
ts_correlation(x, y, d)	Correlation of x and y for the past d days
ts_covariance(x, y, d)	Covariance of x and y for the past d days

"""
Input Data
returns = daily close-to-close returns
open, close, high, low, volume = standard definitions for daily price and volume data
vwap = daily volume-weighted average price
cap = market cap
adv{d} = average daily dollar volume for the past d days
IndClass = a generic placeholder for a binary industry classification such as GICS, BICS, NAICS,
SIC, etc., in indneutralize(x, IndClass.level), where level = sector, industry, subindustry, etc.
Multiple IndClass in the same alpha need not correspond to the same industry classifica

https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/data/create_datasets.ipynbtion
"""ible


In [2]:
####  Imports & Settings
import warnings
warnings.filterwarnings('ignore')
####
import os
from pathlib import Path
import requests
from io import BytesIO
from zipfile import ZipFile, BadZipFile

import numpy as np
import pandas as pd
import pandas_datareader.data as web
from sklearn.datasets import fetch_openml

pd.set_option('display.expand_frame_repr', False)
"""
                Set Data Store path
"""

DATA_STORE = Path('C:/Users/pythonProject/MyJupyterNotebooks/data/wiki_prices/assets.h5')
###################################################  Define a function to clean the column names##################################
# Function to clean column names
def clean_column_name(column_name):
    # Remove leading and trailing whitespaces
    cleaned_name = column_name.strip()
    
    # Remove spaces between words
    cleaned_name = cleaned_name.replace(' ', '')
    
    # Convert to lowercase
    cleaned_name = cleaned_name.lower()
    
    # Remove special characters
    cleaned_name = ''.join(e for e in cleaned_name if e.isalnum())
    
    return cleaned_name
###################################################    Quandl Wiki Prices    ###################################################

print("###################################################    Quandl Wiki stock Prices    ################")
# Use pathlib for cleaner path manipulation
fpath = Path('C:/Users/pythonProject/MyJupyterNotebooks/data/wiki_prices/WIKI_PRICES_212b326a081eacca455e13140d7bb9db.csv')
df1 = (pd.read_csv(fpath,parse_dates=['date'],
                   index_col=['date', 'ticker'],
                   infer_datetime_format=True)
       .sort_index())

#df1.set_index(['date', 'ticker'], inplace=True)
print(df1.info())
print(df1.head())
with pd.HDFStore(DATA_STORE) as store:
    store.put('quandl/wiki/stockprices', df1)

print(" dataframe#1 :: written to Hstore: quandl/wiki/stockprices ===> price data downloaded from Nasdaq ")
print("###################################################    Quandl Wiki stock Prices    ################")


###################################################    Quandl Wiki stock Prices    ################
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15389314 entries, (Timestamp('1962-01-02 00:00:00'), 'ARNC') to (Timestamp('2018-03-27 00:00:00'), 'ZUMZ')
Data columns (total 12 columns):
 #   Column       Dtype  
---  ------       -----  
 0   open         float64
 1   high         float64
 2   low          float64
 3   close        float64
 4   volume       float64
 5   ex-dividend  float64
 6   split_ratio  float64
 7   adj_open     float64
 8   adj_high     float64
 9   adj_low      float64
 10  adj_close    float64
 11  adj_volume   float64
dtypes: float64(12)
memory usage: 1.4+ GB
None
                     open    high     low   close   volume  ex-dividend  split_ratio  adj_open  adj_high   adj_low  adj_close  adj_volume
date       ticker                                                                                                                        
1962-01-02 ARNC     65.

In [3]:
import warnings
warnings.filterwarnings('ignore')
####
import os
from pathlib import Path
import requests
from io import BytesIO
from zipfile import ZipFile, BadZipFile

import numpy as np
import pandas as pd
import pandas_datareader.data as web
from sklearn.datasets import fetch_openml

pd.set_option('display.expand_frame_repr', False)
"""
                Set Data Store path
"""

DATA_STORE = Path('C:/Users/pythonProject/MyJupyterNotebooks/data/wiki_prices/assets.h5')

print("########################    Metadata on US-traded companies    ############################")
# Function to clean column names

def clean_column_name(column_name):
    # Remove leading and trailing whitespaces
    cleaned_name = column_name.strip()
    
    # Remove spaces between words
    cleaned_name = cleaned_name.replace(' ', '')
    
    # Convert to lowercase
    cleaned_name = cleaned_name.lower()
    
    # Remove special characters
    cleaned_name = ''.join(e for e in cleaned_name if e.isalnum())
    
    return cleaned_name


exchanges = ['NASDAQ', 'AMEX', 'NYSE']
fpath = Path('C:/Users/pythonProject/MyJupyterNotebooks/data/us_traded_cos')
all_files = fpath.glob('*.csv')  # Get all CSV files in the folder
df = pd.DataFrame()  # Initialize an empty DataFrame


for filename in all_files:
    # Read each CSV file into a DataFrame
    df_temp = pd.read_csv(filename)

    # Drop columns with all NA values
    df_temp.dropna(how='all', axis=1, inplace=True)

    # Concat the temporary DataFrame to the main DataFrame
    df5 = pd.concat([df, df_temp], ignore_index=True)
    

df5.columns = [clean_column_name(name) for name in df5.columns]
df5.rename(columns={'symbol': 'ticker'}, inplace=True)
df5 = df5.rename(columns=str.lower).set_index('ticker')
# remove duplicate rows from a pandas DataFrame based on the index
df5 = df5[~df5.index.duplicated()]

with pd.HDFStore(DATA_STORE) as store:
    store.put('us_equities/stocks',df5)


print(df5.info())
print(df5.head())
print(" dataframe#5 :: written to Hstore: us_equities/ustradedcos ===> Metadata on US-traded companies ")
print("########################    Metadata on US-traded companies    ############################")

########################    Metadata on US-traded companies    ############################
<class 'pandas.core.frame.DataFrame'>
Index: 2856 entries, A to ZWS
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       2856 non-null   object 
 1   lastsale   2856 non-null   object 
 2   netchange  2856 non-null   float64
 3   change     2856 non-null   object 
 4   marketcap  2460 non-null   float64
 5   country    2601 non-null   object 
 6   ipoyear    1516 non-null   float64
 7   volume     2856 non-null   int64  
 8   sector     2382 non-null   object 
 9   industry   2382 non-null   object 
dtypes: float64(3), int64(1), object(6)
memory usage: 310.0+ KB
None
                                                     name lastsale  netchange   change     marketcap        country  ipoyear   volume       sector                                          industry
ticker                                                       