## Ensemble Feature Selection Tutorial
Welcome to the Ensemble Feature Selection Tutorial! In this tutorial, we will guide you through the in details possibility of using the ensemblefeatureselection library to perform MultiObjective Ensemble Feature Selection 

## Dataset and DataProcessor module 

To illustrate our method we will use the wine dataset from sklearn.

Using the DataProcessor module we can easily go from pd.DataFrame or directly from a csv file to a processed dataset. It allows to encode categorical variable, to drop columns, to normalize non categorical columns, modify the name of the target column to 'target' as it is necessary for the pipeline 

In [None]:
from sklearn.datasets import load_wine
from ensemblefs.core import DataProcessor
import pandas as pd

# Load Wine dataset
wine_data = load_wine()
wine_df = pd.DataFrame(data= wine_data.data, columns= wine_data.feature_names)
wine_df['target'] = wine_data.target

# process data using DataProcessor module
dp = DataProcessor(
    categorical_columns=['target'],
    columns_to_drop=['nonflavanoid_phenols'],
    drop_missing_values=True,
    normalize=True,
    target_column='target'
)

processed_data = dp.preprocess_data(wine_df)
processed_data


ModuleNotFoundError: No module named 'ensemblefs'

## Feature Selector and Merging Strategy 

Each feature selector and merging strategy is encapsulated within its own class, which contains specific functionality. In the simple example tutorial, we accessed these classes using string identifiers. However, for more advanced usage, users can directly instantiate feature selectors, allowing for the specification of more detailed parameters.

For example, in the code snippet below, the `RandomForestSelector` utilizes the sklearn random forest module. By creating an instance called `rf_selector`, users can pass advanced parameters using keyword arguments. This instance can then be passed directly to the pipeline, offering similar flexibility with merging strategies.

In [None]:
from ensemblefs.feature_selectors import RandomForestSelector, FStatisticSelector
from ensemblefs.merging_strategies import UnionOfIntersectionsMerger

# Feature Selector
rf_selector = RandomForestSelector(task='classification', num_features_to_select=5, n_estimators=1000, max_depth=5, min_samples_leaf=2)
f_selector = FStatisticSelector(task='classification', num_features_to_select=5)

# Merging Strategy
union_merger = UnionOfIntersectionsMerger()


You can then initialize the pipeline with the defined parameters and execute it.

In the example below, we have introduced a new parameter, `min_group_size`, which ensures that each generated group within the pipeline contains a minimum of three feature selectors. This is an advanced feature; typically, the minimum size is set to two, representing the smallest possible ensemble. However, there may be cases where setting a higher minimum size is desirable.

In [None]:
from ensemblefs import FeatureSelectionPipeline

num_repeats = 5
task = "classification"
num_features_to_select = 6

pipeline = FeatureSelectionPipeline(
    data=wine_df,
    fs_methods=[rf_selector, f_selector, "mutual_info_selector", "svm_selector"],
    merging_strategy=union_merger,
    num_repeats=num_repeats,
    task=task,
    num_features_to_select=num_features_to_select,
    min_group_size=3
)

selected_features, best_repeat, best_group = pipeline.run()

print(f"The selected features are {selected_features}, from repeat {best_repeat} and group {best_group}")

Pipeline Progress: 100%|██████████| 5/5 [00:15<00:00,  3.09s/it]

The selected features are {'alcohol', 'color_intensity', 'od280/od315_of_diluted_wines', 'flavanoids', 'proline'}, from repeat 3 and group ('RandomForest', 'FStatistic')





### Metrics

The **performance metrics** used to assess feature subsets during the pipeline are another area for customization. A pipeline object requires a list of three implemented metrics, which can be passed as the `metrics` parameter during initialization.

Several metrics are already implemented, and you can also define your own.

#### **Implemented Classification Metrics**
1. **Log-loss**
2. **Accuracy**
3. **F1 Score**
4. **Recall**
5. **Precision**

#### **Implemented Regression Metrics**
1. **Mean Absolute Error (MAE)**
2. **Mean Squared Error (MSE)**
3. **R² Score**

#### **How to Specify Metrics**
You can provide metrics as strings or as instantiated objects. For example:

```python
from ensemblefs.metrics import RecallScore

recall = RecallScore()

pipeline = FeatureSelectionPipeline(
    data=wine_df,
    fs_methods=[rf_selector, f_selector, "mutual_info_selector", "svm_selector"],
    merging_strategy=union_merger,
    num_repeats=num_repeats,
    metrics=[recall, "precision", "accuracy"],
    task=task,
    num_features_to_select=num_features_to_select,
    min_group_size=3,
)
```

This approach allows flexibility, enabling you to combine predefined metrics with your custom implementations


## Creating a Custom Feature Selector or Merging Strategy

1. **Inherit from the Appropriate Base Class**: Your new class should inherit from either the `FeatureSelector` base class (for feature selectors) or the `MergingStrategy` base class (for merging strategies).

2. **Implement Required Methods**: 
   - For **feature selectors**, implement a method called `.compute_score()`, which will be responsible for evaluating and selecting features based on their importance.
   - For **merging strategies**, implement a method called `.merge()`, which will handle the merging of selected features from different selectors.

3. **Instance-Based Usage**: Note that the pipeline currently operates with instances of your classes, as string identifiers for class names are not implemented. Therefore, you will need to directly instantiate your custom feature selector or merging strategy when using it in the pipeline.

4. **Modify Class Mapping if Necessary**: If you want to integrate your new class with a string identifier for easier usage in the future, you will need to modify the class mapping in the `utils.py` file accordingly.

This structure allows you to customize the feature selection process to meet your specific needs. For detailed guidance on how to implement these classes, please refer to the documentation or relevant tutorials.

In [None]:
from ensemblefs.feature_selectors import FeatureSelector
import numpy as np

class CustomSelector(FeatureSelector):
    name = "CustomSelector"
    def __init__(self, task, num_features_to_select=None, **kwargs):
        super().__init__(task, num_features_to_select, **kwargs)
        self.kwargs = kwargs

    def compute_scores(self, X, y):
        # Example: Return random feature importance scores for each feature
        # Replace this with the actual logic for calculating feature importance
        return np.random.rand(X.shape[1])

In [None]:
from ensemblefs.merging_strategies import MergingStrategy

class CustomMerger(MergingStrategy):
    """
    A custom merging strategy example that inherits from the MergingStrategy base class.
    This class demonstrates how to define your own merging logic.
    """
    name = "CustomMerger"

    def __init__(self, **kwargs):
        # Specify if the merging strategy is "set-based" or "rank-based"
        super().__init__("set-based")
        self.kwargs = kwargs

    def merge(self, subsets, k_features=None):
        # Example: Define custom merging logic here
        # This could be a unique way of combining feature sets or ranks
        # Return a final list of selected features based on the merging logic
        pass



And your done! Now you can create an instance and you your custom feature selector or merging stragegy with other methods inside the pipeline! 