# Data pre-processing

- Common pre-processing steps for ML analyses in Cognitive Neuroscience
- Give examples:
    - How to do feature selection with F-Score
    - How to do dimensionality reduction with PCA

## Standard Scaler

- Explain Standard Scaler

In [262]:
from sklearn.datasets import load_wine

dataset = load_wine(as_frame=True)
dataset['frame']

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


In [17]:
print(dataset['DESCR'])

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

This is a multi-class classification problem

Read documentation of Standard Scaler [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [263]:
from sklearn.preprocessing import StandardScaler

# Define X and y
X = dataset['data']
y = dataset['target']

# Create and fit scaler
scaler = StandardScaler()
scaler = scaler.fit(X)

In [91]:
import numpy as np

np.mean(X, axis=0)

alcohol                          13.000618
malic_acid                        2.336348
ash                               2.366517
alcalinity_of_ash                19.494944
magnesium                        99.741573
total_phenols                     2.295112
flavanoids                        2.029270
nonflavanoid_phenols              0.361854
proanthocyanins                   1.590899
color_intensity                   5.058090
hue                               0.957449
od280/od315_of_diluted_wines      2.611685
proline                         746.893258
dtype: float64

In [92]:
X_tr = scaler.transform(X)

print(f"means: {np.round(np.mean(X_tr, axis=0), 2)}")
print(f"stds: {np.round(np.std(X_tr, axis=0), 2)}")


means: [-0. -0. -0. -0. -0.  0. -0.  0. -0.  0.  0.  0. -0.]
stds: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [94]:
from sklearn.svm import SVC

svc = SVC().fit(X, y)
svc_scaled_data = SVC().fit(X_tr, y)

print(svc.score(X, y))
print(svc_scaled_data.score(X_tr, y))

0.7078651685393258
1.0


(important: name differently the X transformed)

### Exercise 

The steps can be simplified using `fit_transform`. Read how to use it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and implement it.

## Feature selection

- Explain feature selection and why is necessary
- Explain Select K Best (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)) and that we will use ANOVA f-value for this example
- We need to pass the `y`

In [95]:
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(k=5)
selector = selector.fit(X_tr, y)
X_sel = selector.transform(X_tr)

In [79]:
vars(selector)

{'score_func': <function sklearn.feature_selection._univariate_selection.f_classif(X, y)>,
 'k': 5,
 'n_features_in_': 13,
 'scores_': array([135.07762424,  36.94342496,  13.3129012 ,  35.77163741,
         12.42958434,  93.73300962, 233.92587268,  27.57541715,
         30.27138317, 120.66401844, 101.31679539, 189.97232058,
        207.9203739 ]),
 'pvalues_': array([3.31950380e-36, 4.12722880e-14, 4.14996797e-06, 9.44447294e-14,
        8.96339544e-06, 2.13767002e-28, 3.59858583e-50, 3.88804090e-11,
        5.12535874e-12, 1.16200802e-33, 5.91766222e-30, 1.39310496e-44,
        5.78316836e-47])}

In [73]:
selector.scores_

array([135.07762424,  36.94342496,  13.3129012 ,  35.77163741,
        12.42958434,  93.73300962, 233.92587268,  27.57541715,
        30.27138317, 120.66401844, 101.31679539, 189.97232058,
       207.9203739 ])

In [84]:
print(X_tr.shape)
print(X_sel.shape)

(178, 13)
(178, 5)


### Exercises
- Implement another score metric for selecting the k best
- Compare performance of model with selection and without selection?

## Dimensionality reduction

- Explain dimensionality reduction and why you would want to implement it
- Explain PCA (link to my notebook?) (link to documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA))
 

In [255]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=300, n_features=300, 
    n_informative=5, 
    n_redundant=100, random_state=0
)


- We only need X

In [256]:
from sklearn.decomposition import PCA

# Fit PCA
pca = PCA(n_components=5, random_state=0).fit(X)

# Transform data
X_pca = pca.transform(X)

We don't have time to explain the inner outputs of PCA and their rationale, but you can read more about them in detail here(link to my notebook)

In [257]:
X_pca.shape

(300, 5)

In [258]:
from sklearn.linear_model import LogisticRegression

svc = LogisticRegression().fit(X, y)
svc_scaled_data = LinearRegression().fit(X_pca, y)

print(svc.score(X, y))
print(svc_scaled_data.score(X_pca, y))

1.0
0.48828248184039524
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Pipelines

- How to implement a pipeline in sklearn
- Creating a pipeline is also an useful tool for avoiding leaking information when splitting the data into training/testing sets or doing cross-validation, but this will be explained in the next chapter

Documentation of pipeline can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

In [259]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC())
])


In [266]:
# Define X and y
X = dataset['data']
y = dataset['target']

# Fit and print estimators
pipe = pipe.fit(X, y)

# Score model
pipe.score(X, y)

1.0

Documentation of `make_pipeline` can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)

In [267]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), SVC())
pipe

Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])

In [268]:
pipe.fit(X, y).score(X, y)

1.0

### Exercises
 Create a pipeline with different estimators (define which)