# Centering and scaling


In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('music_clean.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1000 non-null   int64  
 1   popularity        1000 non-null   float64
 2   acousticness      1000 non-null   float64
 3   danceability      1000 non-null   float64
 4   duration_ms       1000 non-null   float64
 5   energy            1000 non-null   float64
 6   instrumentalness  1000 non-null   float64
 7   liveness          1000 non-null   float64
 8   loudness          1000 non-null   float64
 9   speechiness       1000 non-null   float64
 10  tempo             1000 non-null   float64
 11  valence           1000 non-null   float64
 12  genre             1000 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 101.7 KB


In [3]:
print(df[['duration_ms', 'loudness', 'speechiness']].describe())

        duration_ms     loudness  speechiness
count  1.000000e+03  1000.000000  1000.000000
mean   2.172204e+05    -8.253305     0.077879
std    1.175582e+05     5.158523     0.089451
min   -1.000000e+00   -38.718000     0.023400
25%    1.806562e+05    -9.775500     0.033100
50%    2.163000e+05    -6.855000     0.043600
75%    2.605025e+05    -4.977750     0.074950
max    1.617333e+06    -0.883000     0.710000


In above example we see that ranges in our dataset are vary wide. Some models can be influence by this disproportion in scales. Why? Because a lot of models use some form of distance to generalize on data (knn). 

In order to prevent bad generalization we want features to be on a similar scale. So let's see how to normalize or standardize the data. 

## Scaling

In order to **standardize** the data we subtract the mean and divide by variance. In a result all the data is centered around zero and have a variance equal to 1.

We can also use **normalization**, to scale the data to a specific range. Usually it is a range between -1 and 1 or 0 and 1.

**Beware of the traps!**

Normalization is sensitive to outliers. It uses maximum and minimum vlaues to perform scaling. If the outliers are present in the dataset they can affect the scaling process. 

*Other things to keep in mind upon choosing the best scaling method?*
* Impact on data distribution
* Interpretaion of Transformed Values
* Algorithm Sensitivity

In [6]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X = df.drop('genre', axis=1).values
y = df['genre'].values

# Splitting to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=42)

sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train) #fitting and transofrming the training set
X_test_scaled = sc.transform(X_test) # transforming the test set

# Let's look on the mean and std

print(np.mean(X), np.std(X))
print(np.mean(X_train_scaled), np.std(X_train_scaled))

20666.582585618085 68890.98734103922
3.5971225997855074e-16 0.9999999999999996


# Scaling in pipeline

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier(n_neighbors=6))]

pip = Pipeline(steps)

knn_sc = pip.fit(X_train, y_train)
y_pred = knn_sc.predict(X_test)
knn_sc.score(X_test, y_test)

0.89

In [8]:
# Let's compare the scores with unscale data

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
knn.score(X_test, y_test)

0.925

It seems like this time scaling could yield different results than expected XDDD.

## CV and scaling in pipeline

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier(n_neighbors=6))]

pip = Pipeline(steps)
params = {'knn__n_neighbors': np.arange(1,50)}

cv = GridSearchCV(pip, param_grid=params)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
cv.score(X_test, y_test)
cv.best_score_, cv.best_params_

(0.9287500000000002, {'knn__n_neighbors': 2})