# 03 - Data pre-processing and pipelines

- !! Explain and describe some common pre-processing steps for ML analyses in Cognitive Neuroscience, and why these are useful/needed.
- !! Explain that common pre-processing steps are so called transformers in scikit-learn, and transformers are also estimators.

# Standard Scaler

- Explain what standarizing your data means.

Let's standarize our data using scikit-learn. For this example we will use the wine dataset of scikit-learn. Let's load it:

In [2]:
from sklearn.datasets import load_wine

dataset = load_wine(as_frame=True)
dataset['frame']

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


We can also read the description of the dataset that sklearn provides:

In [4]:
print(dataset['DESCR'])

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

The description of the dataset reveals that:
1. The example contains three clases as targets of predictions. That means, it is a multi-class classification problem. 
2. We are trying to predict the class of wine using 13 features.
3. If you take a look at the mean and standard deviation of these 13 features, you will notice their scale is different. 

We will thus scale each of the features so that they have a mean of 0 and a standard deviation of 1. In scikit-learn we can achieve this by calling the `StandardScaler` class (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)).

In [5]:
from sklearn.preprocessing import StandardScaler

# Define X and y
X = dataset['data']
y = dataset['target']

# Create and fit scaler
scaler = StandardScaler().fit(X)

Notice that for fitting the scaler we only need the input features (`X`), and not the target (`y`). But calling the `fit` method is not enough to transform our input to the model. We also need to call the `tranform` method: 

In [6]:
X_tr = scaler.transform(X)

Notice that we only tranform the input `X`. Also, we assign to the output of the transformation a different variable than `X` (in this case we used `X_tr`), otherwise the method won't give us the expected outcome. Let's make sure the data got scaled and get the mean and standard deviation for each feature, which should be 0 and 1 respectively:

In [7]:
import numpy as np

print(f"means: {np.round(np.mean(X_tr, axis=0), 2)}")
print(f"stds: {np.round(np.std(X_tr, axis=0), 2)}")

means: [-0. -0. -0. -0. -0.  0. -0.  0. -0.  0.  0.  0. -0.]
stds: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


As we stated before, support vector machines can be impacted by the scaling of the data. Let's compare the performance of such model with and without the data scaled:

In [9]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

svc = SVC().fit(X, y)
svc_scaled_data = SVC().fit(X_tr, y)

print(f"SVC accuracy with non-scaled data: {np.round(svc.score(X, y), 2)}")
print(f"SVC accuracy with non-scaled data: {np.round(svc_scaled_data.score(X_tr, y), 2)}")

SVC accuracy with non-scaled data: 0.97
SVC accuracy with non-scaled data: 1.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Indeed, the performance of the support vector classifier increased when we scaled our data.

### Exercise 

The fitting and transformation steps can be simplified using `fit_transform`. Read how to use it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and implement it.

#### Answer

# Dimensionality reduction

- !! Explain what dimensionality reduction is
- !! Explain why do it (curse of dimensionality)
    - Mention this can lead to fitting our model to noise, but we will explain this concept in detail in the next notebook
    - With this approach we can remove noisy data that might be affecting the performance of our model

- There are two main types of dimensionality reduction: 
    - _Feature selection_: We select a subset of our features based on some method.
    - _Feature extraction_: We create new (and usually fewer) features based on the existing ones.

We will now see some examples of feature selection and feature extraction, and how to compute them using scikit-learn.

## Feature selection

In feature selection, we select of subset of the feature columns present in `X`. We perform this subselection based on some method.

In scikit-learn we find many methods to perform feature selection, as can be seen [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection). In this example, we will illustrate how `SelectKBest` works (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)).

`SelectKBest` selects the __k__ features that have the highest score when evaluated with a pre-defined scoring function. Scikit-learn provides many different types of scoring functions that can be used, and by default uses the ANOVA F-value of the sample (see [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)).

Let's select the 5 features with the highest F-value using this function:

In [12]:
from sklearn.feature_selection import SelectKBest

# Create and fit selector
selector = SelectKBest(k=5).fit(X_tr, y)

# Transform data
X_sel = selector.transform(X_tr)

Let's inspect the `selector` object after fitting:

In [23]:
vars(selector)
#vars(selector).keys()

{'score_func': <function sklearn.feature_selection._univariate_selection.f_classif(X, y)>,
 'k': 5,
 'n_features_in_': 13,
 'scores_': array([135.07762424,  36.94342496,  13.3129012 ,  35.77163741,
         12.42958434,  93.73300962, 233.92587268,  27.57541715,
         30.27138317, 120.66401844, 101.31679539, 189.97232058,
        207.9203739 ]),
 'pvalues_': array([3.31950380e-36, 4.12722880e-14, 4.14996797e-06, 9.44447294e-14,
        8.96339544e-06, 2.13767002e-28, 3.59858583e-50, 3.88804090e-11,
        5.12535874e-12, 1.16200802e-33, 5.91766222e-30, 1.39310496e-44,
        5.78316836e-47])}

In [14]:
selector.scores_

array([135.07762424,  36.94342496,  13.3129012 ,  35.77163741,
        12.42958434,  93.73300962, 233.92587268,  27.57541715,
        30.27138317, 120.66401844, 101.31679539, 189.97232058,
       207.9203739 ])

Let's check that `X_sel` has now less feature columns than `X`:

In [25]:
print(f"Number of columns in original data: {X.shape[1]}")
print(f"Number of columns in data after feature selection: {X_sel.shape[1]}")

Number of columns in original data: 13
Number of columns in data after feature selection: 5


### Exercise

Use another score metric for selecting the k features with the best performance.

#### Answer

## Feature extraction

- We tranform our features into a new set of features living in a lower dimensional space
- There are linear/non-linear methods
- !! Explain PCA (link to notebook?) (link to documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA))
 

In scikit-learn, dimensionality reduction methods are transformer objects. In this example, we will implement `PCA` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)) to perform dimensionality reduction and feature selection.

!! Let's first create a classification problem:

In [1]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=300, n_features=300, 
    n_informative=5, n_redundant=100, 
    random_state=0
)

Let's compute `PCA` over `X`: 

In [256]:
from sklearn.decomposition import PCA

# Fit PCA
pca = PCA(n_components=5, random_state=0).fit(X)

# Transform data
X_pca = pca.transform(X)

!! We don't have time to explain the inner outputs of PCA and their rationale, but you can read more about them in detail here (link to notebook)

In [257]:
X_pca.shape

(300, 5)

### Exercise

{TO-DO}

# Pipelines

- The convenience of pipelines
- How to implement a pipeline in sklearn
- Creating a pipeline is also an useful tool for avoiding leaking information when splitting the data into training/testing sets or doing cross-validation, but this will be explained in the next chapter

Documentation of pipeline can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

In [259]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC())
])

In [266]:
# Define X and y
X = dataset['data']
y = dataset['target']

# Fit and print estimators
pipe = pipe.fit(X, y)

# Score model
pipe.score(X, y)

1.0

We can also use `make pipeline` for a faster approach (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)) to creating a pipeline:

In [18]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), SVC())
pipe

Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])

In [19]:
pipe.fit(X, y).score(X, y)

1.0

### Exercise
 Create a pipeline with different estimators (define which)

# Check your knowledge

Load the ABIDE 2 dataset and create two pipelines:
1. _Pipeline 1_: Standarize your data, perform `PCA` selecting $k$ components, and perform classification analysis using `SVC`.
2. _Pipeline 2_: Standarize your data, select $k$ best features using F-score, and perform classification analysis using `SVC`.

Answer the following questions:
1. Which pipeline achieves the best performance?
2. Vary $k$. How does the performance change?


## Additional resources