<a href="https://colab.research.google.com/github/ayotomiwaa/aml-numerai-2/blob/main/Numerai_Variance_Inflation_Factor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies
!pip install -q numerapi pandas pyarrow matplotlib lightgbm scikit-learn cloudpickle lazypredict pandas-profiling

# Inline plots
%matplotlib inline

## 1. Dataset  

At a high level, the Numerai dataset is a tabular dataset that describes the stock market over time.

Each row represents a stock at a specific point in time, where `id` is the stock id and the `era` is the date. The `features` describe the attributes of the stock (eg. P/E ratio) known on the date and the `target` is a measure of 20-day returns.

The unique thing about Numerai's dataset is that it is `obfuscated`, which means that the underlying stock ids, feature names, and target definitions are anonymized. This makes it so that we can give this data out for free and so that it can be modeled without any financial domain knowledge (or bias!).

### Downloading the Dataset

In [None]:
# Initialize NumerAPI - the official Python API client for Numerai
from numerapi import NumerAPI
napi = NumerAPI()

# Print all files available for download in the latest dataset
[f for f in napi.list_datasets() if f.startswith("v4.2")]

In [None]:
import pandas as pd
import json

# # Download the training data and feature metadata
# # This will take a few minutes 🍵
napi.download_dataset("v4.2/train_int8.parquet");
napi.download_dataset("v4.2/features.json");

In [None]:
# Load only the "medium" feature set to reduce memory usage and speedup model training (required for Colab free tier)
# Use the "all" feature set to use all features
feature_metadata = json.load(open("v4.2/features.json"))
feature_cols = feature_metadata["feature_sets"]["medium"]
train = pd.read_parquet("v4.2/train_int8.parquet", columns=["era"] + feature_cols + ["target"])

# Downsample to every 4th era to reduce memory usage and speedup model training (suggested for Colab free tier)
# Comment out the line below to use all the data
train = train[train["era"].isin(train["era"].unique()[::100])]
train

### Eras
As mentioned above, each `era` corresponds to a different date. Each era is exactly 1 week apart.

It is helpful to think about rows of stocks within the same `era` as a single example. You will notice that throughout this notebook and other examples, we often talk about things "per era". For example, the number of rows per era represents the number of stocks in Numerai's investable universe on that date.

In [None]:
# Plot the number of rows per era
train.groupby("era").size().plot(title="Number of rows per era", figsize=(5, 3), xlabel="Era");

## Variance Inflation Factor

In [None]:
data = train.drop(['era','target'],axis=1).iloc[: ,;5]
# Convert int8 columns to float64
for col in data.select_dtypes(include=['int8']).columns:
    data[col] = data[col].astype('float64')

In [None]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Assuming 'data' is your Numerai dataset loaded as a pandas DataFrame
# Replace 'data' with the name of your DataFrame

# Select only the numerical features for VIF calculation
numerical_features = data.select_dtypes(include=['float64', 'int64'])

# Threshold for VIF
vif_threshold = 5.0  # You can adjust this threshold as needed

# DataFrames to store the final results
remaining_columns = pd.DataFrame(columns=['feature', 'VIF'])
removed_columns = pd.DataFrame(columns=['feature', 'VIF'])

while True:
    # Calculating VIF for each feature
    vif_data = pd.DataFrame()
    vif_data["feature"] = numerical_features.columns
    vif_data["VIF"] = [variance_inflation_factor(numerical_features.values, i) for i in range(len(numerical_features.columns))]

    # Find max VIF
    max_vif = vif_data['VIF'].max()

    if max_vif > vif_threshold:
        # Find feature with max VIF
        max_vif_feature = vif_data[vif_data['VIF'] == max_vif]['feature'].iloc[0]

        # Add to removed columns
        removed_columns = removed_columns.append({'feature': max_vif_feature, 'VIF': max_vif}, ignore_index=True)

        # Drop the feature with max VIF
        numerical_features.drop(columns=[max_vif_feature], inplace=True)
    else:
        # All VIFs are below the threshold
        remaining_columns = vif_data
        break

print("Remaining Columns:\n", remaining_columns)
print("\nRemoved Columns:\n", removed_columns)