In [1]:
%matplotlib inline


# Meta-Learner


First check to have a directory like this:

```text
lib
├── data                        # Store data
│   ├─ dataset                  # Store datasets
│   ├─ metafeatures             # Store metafeatures
│   └─ model                    # Store trained ML models
├── images                      # Images for presentations, README, etc.
├── other                       # Script or Notebook related to the thesis or to the plots
├── src                         # Actual code
│   ├─ test                     # Test code
│   ├─ utils                    # General utility code
│   ├─ exceptions.py            # To handle custom exceptions
|   └─ config.py                # Common knowledge for the project
|── main.py
|── requirements.txt
|── setup.py
|── test.py                     # To test all
└── Tutorial.ipynb              # Simple notebook tutorial
```

## Data
To train the meta-learner we first need the data.

In [None]:
from os.path import join
from src.config import DATASET_FOLDER 
from src.utils.data_preparation import data_preparation 

# Just a directory where you've stored your CSV datasets.
prova = join(DATASET_FOLDER, 'prova') 

data_preparation(
    data_path=prova,
    data_selection = False,
    data_preprocess = True,
    metafeatures_extraction = True,
    model_training = True,
    quotient=True)

The 'data preparation' function can perform multiple functions: 
* **Data download**: downloads the datasets. If you already have a dataset you can disable it by setting `data_selection = False`
* **Data preprocessing**: performs all preprocessing of all datasets. If you have already done it you can disable it by setting `data_preprocess = False`
* **Metafeatures Extraction**: extract metafeatures from all datasets, preprocessed and not. If you have already done it you can disable it by setting `metafeatures_extraction = False`
* **Models Training**: train all models on all datasets, preprocessed and not. If you have already done it you can disable it by setting `model_training = False`

* **Quotient** is used to regulate the delta. If it's False the difference between the metrics is done by a quotient, else by a subtraction.

## Train

If you want to train on the delta of the performances

In [None]:
from os.path import join
from src.utils.metalearner import train_metalearner
from src.config import METAFEATURES_FOLDER


delta_path = join(METAFEATURES_FOLDER, "delta.csv")

train_metalearner(
    metafeatures_path = delta_path,
    algorithm='random_forest')

If you want to train on the raw data and than compute the difference (delta) after

In [None]:
from os.path import join
from src.utils.data_preparation import choose_performance_from_metafeatures
from src.utils.metalearner import train_metalearner
from src.config import METAFEATURES_FOLDER

metafeatures_path = join(METAFEATURES_FOLDER, "metafeatures.csv")

choose_performance_from_metafeatures(
    metafeatures_path = metafeatures_path,
    metric='f1_score',
    copy_name='new_metafeatures.csv')

new_metafeatures_path = join(METAFEATURES_FOLDER, "new_metafeatures.csv")

train_metalearner(
    metafeatures_path = new_metafeatures_path,
    algorithm='random_forest')

To check if it's better to use delta_metafeatures or metafeatures we can use `delta_or_metafeatures`. 

In [None]:
from src.utils.data_preparation import delta_or_metafeatures

delta_path = join(METAFEATURES_FOLDER, "delta.csv")
metafeatures_path = join(METAFEATURES_FOLDER, "metafeatures.csv")
delta_or_metafeatures(delta_path=delta_path, metafeatures_path=metafeatures_path)

## Prediction

If you want to estimate the improvement rate of a dataset after preprocessing you have to use the function `predicted_improvement`.

The estimate takes into account the machine learning model considered. 

In [None]:
from src.utils.preprocessing_improvement import predicted_improvement
from src.config import METAFEATURES_MODEL_FOLDER

some_dataset = join(
            DATASET_FOLDER,
            join('Test', 'wine-quality-white.csv')
            )

predicted_improvement(
    dataset_path= some_dataset,
    preprocessing = 'pca',
    algorithm = 'svm',
    metalearner_path = join(METAFEATURES_MODEL_FOLDER, 'metalearner_gaussian_process.joblib')
)

Most of the time taken by the prediction is due to the preprocessing time of the dataset.

If the preprocessing has already been carried out, this can be indicated by means of the variable `preprocessing_path`

In [None]:
from src.config import TEST_FOLDER
test_dataset_path = join(TEST_FOLDER, 'data')

predicted_improvement(
    dataset_path= some_dataset,
    preprocessing_path = join(test_dataset_path, 'pca', 'kc1.csv'),
    algorithm = 'svm',
    metalearner_path = join(METAFEATURES_MODEL_FOLDER, 'metalearner_random_forest.joblib')
)

## Brute force

If you want to search for the best for brute force you can use the function `one_step_bruteforce`.

This function returns a dictionary with the preprocessing used and the estimated delta.

If you want to have the best preprocessing, without having the full list of estimates, you have to use `best_one_step_bruteforce`.

In [None]:
from src.utils.preprocessing_improvement import one_step_bruteforce

results = one_step_bruteforce(
    dataset_path= some_dataset,
    algorithm = 'svm',
    metalearner_path = join(METAFEATURES_MODEL_FOLDER, 'metalearner_random_forest.joblib')
)

print(results)

In [None]:
from src.utils.preprocessing_improvement import best_one_step_bruteforce

best = best_one_step_bruteforce(
    dataset_path= some_dataset,
    algorithm = 'svm',
    metalearner_path = join(METAFEATURES_MODEL_FOLDER, 'metalearner_random_forest.joblib')
)

print(best)

## Example of use

Suppose I want to use this estimator to calculate a reasonable preprocessing pipeline.

I want to consider only `PCA`, `Standard Scaler` and `Feature Agglomeration` as methods.

I want the Standard Scaler to be run first and a step may or may not exist. 

In all, I have 9 possible pipelines to test:

1) SS 
2) SS -> PCA
3) SS -> PCA -> FA
4) SS -> FA
5) SS -> FA -> PCA
6) PCA
7) PCA -> FA
8) FA
9) FA -> PCA


In [None]:
from src.utils.preprocessing_improvement import pipeline_experiments, max_in_dict

list_of_experiments = [
    ['standard_scaler'],
    ['standard_scaler', 'pca'],
    ['standard_scaler', 'pca', 'feature_agglomeration'],
    ['standard_scaler', 'feature_agglomeration'],
    ['standard_scaler', 'feature_agglomeration', 'pca'],
    ['pca'],
    ['pca', 'feature_agglomeration'],
    ['feature_agglomeration'],
    ['feature_agglomeration', 'pca'],
]

experiments = pipeline_experiments(
    dataset_path = some_dataset,
    algorithm = 'svm',
    list_of_experiments = list_of_experiments,
    metalearner_path = join(METAFEATURES_MODEL_FOLDER, 'metalearner_random_forest.joblib')
    )

print(experiments)

[key, value] = max_in_dict(results)
print(f"The best experiment is {key}, with {value} estimated improvement.")

If you want to run a single experiment you can either use `pipeline_experiments` or use `preprocessing_experiment`.

In [None]:
from src.utils.metafeatures_extraction import metafeature
from src.utils.preprocessing_improvement import preprocessing_experiment

data_metafeatures = metafeature(some_dataset)
result = preprocessing_experiment(
    dataset_path = some_dataset,
    algorithm = 'svm',
    experiment = list_of_experiments[2],
    data_metafeatures = data_metafeatures,
    metalearner_path = join(METAFEATURES_MODEL_FOLDER, 'metalearner_random_forest.joblib')
)

print(result)