# Getting Started with Ensemble Feature Selection

Welcome to the Ensemble Feature Selection library! In this notebook, we will provide you with a guided tour
of the library's features and functionalities. You will learn how to set up the pipeline, define key parameters,
and utilize various methods for effective feature selection through a multi-objective optimization approach.

## Project Structure Overview
The project structure is organized as follows:
- **core**: Contains core functionality including data processing and metrics calculation.
- **utils**: Some utility functions.
- **feature_selectors**: Includes various feature selection methods like f_statistic, mutual_info, and others.
- **merging_strategies**: Houses methods for merging feature selection results.
- **feature_selection_pipeline**: Main class that can exectute the whole pipeline using the .run method 

## Setup and Installation
To set up your environment, you need to the efs library, you can simply do pip install ensemblefeatureselection, or you can install it from root and do pip install . 

In [None]:
# !pip install ensemblefs

## The feature_selection_pipeline Class

The feature_selection_pipeline class is the core component of our project. It orchestrates the feature selection process based on the configurations you provide.

Basic Usage
To use the pipeline, you simply need to define the necessary parameters and execute the .run() method. All the underlying logic is encapsulated within this function. For a deeper understanding of the pipeline's operations, please refer to the documentation.

How It Works
1. Group Creation: The pipeline creates 𝑁 groups of the feature selection methods you have selected.
2. Feature Selection: Based on the merging strategies, the desired number of features to select, and other parameters, it will perform feature selection for all these groups over 𝑀M repeats.
3. Metrics: The pipeline compute performance and stability metrics for each group on the different fold of data.
4. Multi-Objective Optimization: Utilizing a Pareto-based method, it returns the selected features.
This class provides a powerful and flexible approach to feature selection, enabling you to harness the strengths of multiple methods in a cohesive framework.

## Pipeline Configuration Parameters

Configuration parameters play a crucial role in customizing the behavior of the pipeline. 

Here are the necessary ones : 

- **data**: Dataset (pandas.DataFrame)
- **fs_methods**: List of feature selection methods.
- **merging_strategy**: Strategy to merge results from different selectors.
- **num_repeats, num_features_to_select, task**: Other important parameters to define the behavior of the pipeline.

And the rest of the parameters : 

- min_group_size: Minimum size (number of methods) for each groups (each ensemble)
- random_state=: Random seed,
- n_jobs: Set the number of jobs for scikit-learn method that accept the parameter

## DataProcessor Module

The `DataProcessor` module facilitates the transition from raw data to a well-structured pandas DataFrame suitable for use in the pipeline. With customizable parameters, it allows you to handle categorical variables, drop unnecessary columns, manage missing values, and normalize numerical features. By configuring the processor according to your dataset's needs, you ensure optimal performance of the feature selection pipeline.

Example : 

```python

from ensemblefs.core import DataProcessor

dp = DataProcessor(
    categorical_columns=['category1', 'category2'],
    columns_to_drop=['unwanted_column'],
    drop_missing_values=True,
    merge_key='id',
    normalize=True,
    target_column='target'
)

data = pd.read_csv('your_dataset.csv')

processed_data = dp.preprocess(data)

## Pipeline Example

To demonstrate the usage of the `FeatureSelectionPipeline`, we can set up the feature selection methods and configuration parameters as follows:

```python
from ensemblefs import FeatureSelectionPipeline

# Define feature selection methods
fs_methods = [
    "f_statistic_selector",
    "random_forest_selector",
    "mutual_info_selector",
]

# Configuration parameters
merging_strategy = "union_of_intersections_merger"
num_repeats = 5
metrics = ["logloss", "f1_score", "accuracy"]
task = "classification"
random_state = 2024
num_features_to_select = 6
n_jobs = 1

# Initialize the pipeline
pipeline = FeatureSelectionPipeline(
    data=processed_data,
    fs_methods=fs_methods,
    merging_strategy=merging_strategy,
    num_repeats=num_repeats,
    num_features_to_select=num_features_to_select,
    metrics=metrics,
    task=task,
    random_state=random_state,
    n_jobs=n_jobs
)

# Run the pipeline
results = pipeline.run()


Instead of defining feature sections methods and merging strategy with string identifier you can also define class object from the respectives class and you can tune them with their parameters. For more information you can look at the advanced example tutorial. 

## Pipeline with script and config file

The example above demonstrates how to use the `FeatureSelectionPipeline` directly by configuring parameters in the script. However, you can also utilize a configuration file to streamline the process. We provide a `config.yml` template that allows you to define your parameters in a structured format.

With a simple script, you can parse the configuration file and execute the ensemble feature selection pipeline, making it easy to define and adjust parameters without modifying the code directly. A template script, `main.py`, is included for this purpose.

You can run the pipeline by executing the following command in your terminal:

```bash
python main.py dataset.csv config.yml
```

Feel free to customize the script and configuration file as needed to suit your specific requirements!

## Extend the library 

You can easily extend the functionality of the FeatureSelectionPipeline by defining new merging strategies and feature selection classes. This flexibility allows you to tailor the pipeline to your specific needs and explore different approaches to feature selection. For detailed guidance on how to implement these customizations, please refer to the [documentation](https://arthurbabey.github.io/EnsembleFeatureSelection/) or the relevant tutorials provided.