---
# Majority Vote Feature Importance Tutorial
---

{introduction}


### Sections

1.) 

2.) 

3.) 

<br>


### Libraries

In [1]:
import os
import git
import logging

# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier, LGBMRegressor

# Get Git Root Directory
DIR_ROOT = git.Repo(search_parent_directories=True).working_tree_dir
os.chdir(DIR_ROOT)

# Project Modules
from src.feature_importance import (
    FeatureImportanceClassification,
    FeatureImportanceRegression,
)

In [2]:
### Settings

In [3]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.setLevel(level="INFO")

In [4]:
## Feature Importance - Classification

In [5]:
### i. Example using synthetically generated data

In [6]:
f_imp_clf = FeatureImportanceClassification(
    num_samples_synthetic=1_000,
    plot_importance=False,
    generate_synthetic_data=True,
    estimators=(
        ("RandomForestClassifier", RandomForestClassifier()),
        ("LGBMClassifier", LGBMClassifier()),
    ),
)

INFO:src.feature_importance:Class object FeatureImportanceClassification instantiated successfully


In [7]:
feature_importance_df = f_imp_clf.fit_transform()

INFO:src.feature_importance:Fitting and transforming the model.
INFO:src.feature_importance:Fitting the model.
INFO:src.feature_importance:Generate synthetic data with num-samples {self.num_samples_synthetic}
INFO:src.feature_importance:	 Generated Data shape: (1000, 20), Target shape: (1000,), Labels Counter({0: 504, 1: 496})
INFO:src.feature_importance:Generating the train test split.
INFO:src.feature_importance:Building the categorical transformer.
INFO:src.feature_importance:Building the numeric transformer.
INFO:src.feature_importance:Building the column transformer.
INFO:src.feature_importance:Building estimator pipelines.
INFO:src.feature_importance:		 Adding Estimator RandomForestClassifier
INFO:src.feature_importance:		 Adding Estimator LGBMClassifier
INFO:src.feature_importance:Transforming the data.
INFO:src.feature_importance:Building Estimators.
INFO:src.feature_importance:	 Fitting Estimator RandomForestClassifier.
INFO:src.feature_importance:	 Generating feature importan



INFO:src.feature_importance:Joining feature importance data.
INFO:src.feature_importance:Median value for RandomForestClassifier is 0.006756068767223521
INFO:src.feature_importance:Calculating total votes.
INFO:src.feature_importance:Calculating majority vote.
INFO:src.feature_importance:	 Fitting Estimator LGBMClassifier.


[LightGBM] [Info] Number of positive: 332, number of negative: 338
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000354 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4473
[LightGBM] [Info] Number of data points in the train set: 670, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.495522 -> initscore=-0.017911
[LightGBM] [Info] Start training from score -0.017911


INFO:src.feature_importance:	 Generating feature importance for LGBMClassifier




INFO:src.feature_importance:Joining feature importance data.
INFO:src.feature_importance:Median value for RandomForestClassifier is 0.006756068767223521
INFO:src.feature_importance:Median value for LGBMClassifier is 0.2207807269067867
INFO:src.feature_importance:Calculating total votes.
INFO:src.feature_importance:Calculating majority vote.


In [8]:
feature_importance_df

Unnamed: 0,RandomForestClassifier,LGBMClassifier,RandomForestClassifier_IS_SIGNIFICANT,LGBMClassifier_IS_SIGNIFICANT,TOTAL_VOTES,IS_MAJORITY
feature_0,0.005145,0.164691,0,0,0,0
feature_1,0.011916,0.228599,1,1,2,1
feature_2,0.004775,0.223197,0,1,1,1
feature_3,0.010547,0.298001,1,1,2,1
feature_4,0.004225,0.121718,0,0,0,0
feature_5,0.010921,0.331867,1,1,2,1
feature_6,0.004567,0.206041,0,0,0,0
feature_7,0.007376,0.218364,1,0,1,1
feature_8,0.108506,0.924843,1,1,2,1
feature_9,0.159713,3.846352,1,1,2,1
