# Centering and scaling
Data imputation is one of several important preprocessing steps for machine learning. In this notebook, will cover another: centering and scaling your data.

## Why scale your data?
To motivate this, let's use df dot describe to check out the ranges of the feature variables in the red wine quality dataset. 

In [7]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('winequality-red.csv')

In [8]:
df.keys()

Index(['fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"'], dtype='object')

In [18]:
df = pd.read_csv('winequality-red.csv',sep = ';')

In [20]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [21]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


The features are chemical properties such as acidity, pH, and alcohol content. The target value is good or bad, encoded as '1' and '0', respectively. We see that the ranges vary widely: 'density' varies from (point) 99 to to 1 and 'total sulfur dioxide' from 6 to 289!

Many machine learning models use some form of distance to inform them so if you have features on far larger scales, they can unduly influence your model. For example, K-nearest neighbors uses distance explicitly when making predictions. For this reason, we actually want features to be on a similar scale. To achieve this, we do what is called normalizing or scaling and centering.

## Ways to normalize your data
There are several ways to normalize your data: given any column,
- you can subtract the mean and divide by the variance so that all features are centred around zero and have variance one. This is called standardization. 
- You can also subtract the minimum and divide by the range of the data so the normalized dataset has minimum zero and maximum one. 
- You can also normalize so that data ranges from -1 to 1 instead.
In this notebook, I'll show how to to perform standardization. See the scikit-learn docs for how to implement the other approaches.

In [22]:
X = df.drop('quality',axis=1).values # drop the target
y = df['quality'].values #keep the target

To scale our features, we import scale from sklearn dot preprocessing. We then pass the feature data to scale and this returns our scaled data. Looking at the mean and standard deviation of the columns of both the original and scaled data verifies this.

In [23]:
from sklearn.preprocessing import scale
X_scaled = scale(X)

In [25]:
import numpy as np
np.mean(X), np.std(X)

(8.134219224515322, 16.726533979432848)

In [26]:
np.mean(X_scaled), np.std(X_scaled)

(2.546626531486538e-15, 1.0)

## Scaling in a pipeline
We can also put a scalar in a pipeline object! To do so, we import StandardScaler from sklearn dot reprocessing and build a pipeline object as we did earlier; here we'll use a K-nearest neighbors algorithm. 

In [29]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

We then split our wine quality dataset in training and test sets, fit the pipeline to our training set, and predict on our test set. 

In [30]:
steps  = [('scaler', StandardScaler()),('knn',KNeighborsClassifier())]
pipeline = Pipeline(steps)

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=21)

In [32]:
knn_scaled = pipeline.fit(X_train,y_train)

In [33]:
y_pred = pipeline.predict(X_test)

In [35]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.615625

In [37]:
knn_unscaled = KNeighborsClassifier().fit(X_train,y_train)
knn_unscaled.score(X_test,y_test)

0.49375

Computing the accuracy yields (point) 0.615625, whereas performing KNN without scaling resulted in an accuracy of (point) 0.49375. Scaling did improve our model performance

# CV and scaling in a pipeline
Let's also take a look at how we can use cross-validation with a supervised learning pipeline. We first build our pipeline. We then specify our hyperparameter space by creating a dictionary: the keys are the pipeline step name followed by a double underscore, followed by the hyperparameter name; the corresponding value is a list or an array of the values to try for that particular hyperparameter. In this case, we are tuning only the n neighbors in the KNN model. As always, we split our data into cross-validation and hold-out sets. We then perform a GridSearch over the parameters in the pipeline by instantiating the GridSearchCV object and fitting it to our training data. The predict method will call predict on the estimator with the best found parameters and we do this on the hold-out set.

We also print the best parameters chosen by our gridsearch, along with the accuracy and classification report of the predictions on the hold-out set.

In [38]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [39]:
steps  = [('scaler', StandardScaler()),('knn',KNeighborsClassifier())]
pipeline = Pipeline(steps)

In [51]:
parameters = {'knn__n_neighbors': np.arange(1,50)}
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=21)

In [52]:
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(pipeline, param_grid = parameters)

In [53]:
cv.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'knn__n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])})

In [54]:
y_pred = cv.predict(X_test)

In [55]:
print(cv.best_params_)

{'knn__n_neighbors': 1}


In [56]:
print(cv.score(X_test,y_test))

0.634375


In [57]:
from sklearn.metrics import classification_report

report_dict = classification_report(y_test, y_pred, output_dict=True)
# Compute metrics
pd.DataFrame(report_dict)

Unnamed: 0,3,4,5,6,7,8,accuracy,macro avg,weighted avg
precision,0.0,0.181818,0.657143,0.675214,0.630435,0.25,0.634375,0.399102,0.631398
recall,0.0,0.125,0.724409,0.603053,0.690476,0.333333,0.634375,0.412712,0.634375
f1-score,0.0,0.148148,0.689139,0.637097,0.659091,0.285714,0.634375,0.403198,0.630905
support,1.0,16.0,127.0,131.0,42.0,3.0,0.634375,320.0,320.0
