# Preprocessing


## 3.1 Scaling

- StandardScaler: Ensures mean=0 and variance=1, bringing all features to the same magnitude. Does not ensure any particular min and max values for the features
- RobusrScaler: Similar to StandardScaler, but it uses median and quartiles, ignoring points that are too different from the others (outliers)
- MinMaxScaler: Ensures all points will be between 0 and 1.
- Normaliser: Scales each data point such as the feature vector has an Euclidian distance of 1. Often used when only the direction (or angle) of the data matters, not the lenght of the feature vector.


**YOU NEED TO APPLY EXACTLY THE SAME TRANSFORMATION TO THE TRAINING AND TEST SET**


Example on how to do it:

````
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# calling fit and transform in sequence (using method chaining)
X_scaled = scaler.fit(X).transform(X)
# same result, but more efficient computation
X_scaled_d = scaler.fit_transform(X)

````




In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split (cancer.data, cancer.target, random_state=1)

print(f"Shape of X_train, before scaling {X_train.shape}")
print(f"Shape of X_test, before scaling (X_test.shape}")

print(f"Max and Min values of X_train, before scaling \n MAX {X_train.max(axis=0)}, MIN {X_train.min(axis=0)}")

# We need to import the class that implements the preprocessing and then instantiate it
scaler = MinMaxScaler()
scaler.fit(X_train) # you do this 'fit' only once, with the training set

# then we proceed with the transformations
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# We could also do the fit and transform in one go
# X_train_scaled = scaler.fit_transform(X_train)
# or, less efficient, but also works the same
# X_train_scaled = scaler.fit(X_train).transform(X_train)


print(f"Max and Min values of X_train, after scaling \n MAX {X_train_scaled.max(axis=0)}, MIN {X_train_scaled.min(axis=0)}")



(426, 30)
(143, 30)


In [None]:
#EXAMPLE ON HOW TO APPLY IN THE CANCER DS
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

svm = SVM()

#pre-processing
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

#learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)


#scoring on the scaled data set
print(f"Scaled test accuracy: {svm.score(X_train_scaled, y_train):.2f}")

