#### Centering and scaling data
I'll demonstrated how significantly the performance of a model can improve if the features are scaled. Of course this is not always the case. For instance, scaling has a minimal impact when all of the features of your model are binary. In this analysis I'll use [white wine quality](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) data from the University of California, Irvine Machine Learning Repository. In this data set wine is graded on quality from 0 to 10 with 0 being bad and 10 being excellent. 

In [192]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [193]:
# Read the CSV file into a DataFrame
df = pd.read_csv('datasets/winequality-white.csv', sep=';')

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [194]:
# For the purposes of this anlaysis we'll make our target variable binary, 'True' or 'False'.
df['quality'] = (df['quality'] <= 5)

#df.loc[df.quality < 5, 'quality'] = True
#df.loc[df['quality'] >= 5, 'quality'] = False

# Now I can create arrays for the features and the target variable
y = df.quality.values
X = df.drop('quality', axis=1).values

In [195]:
# Import scale to scalle the features in 'X'
from sklearn.preprocessing import scale

# Scale the features
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

Mean of Unscaled Features: 18.4326870725
Standard Deviation of Unscaled Features: 41.5449476409
Mean of Scaled Features: 2.73993761427e-15
Standard Deviation of Scaled Features: 1.0


##### Scaled Features
As you can see scalling the features drastically reduces the variance between them. In the data set 'density', for instance, takes values between 0.98 and 1.04, while 'total sulfur dioxide' ranges from 9 to 440, and this is why scaling comes in handy. Following this, we'll see how scaling the feature–in this case positively–affects model performance.

In [199]:
# First I'll wrangle the unscaled data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# store the model's score in a variable for later use
accuracy_no_scaling = knn_unscaled.score(X_test, y_test)

In [200]:
# Import the necessary pipeline and scaling modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the training set
knn_scaled = pipeline.fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(accuracy_no_scaling))

Accuracy with Scaling: 0.770068027211
Accuracy without Scaling: 0.697959183673


##### Model Performance
The performance of this model increased 7.2 percantage points just by scaling the data. For a final exercise I'll create another model pipeline making use of hyperparameter tuning, 'SVM' as the classifier, and again 'StandardScalar' to scale the data. SVM has hyperparameters of 'C' and 'gamma' where 'C' controls the regularization strength and 'gamma' controls the kernel coefficient.

In [198]:
# Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]

# Create the pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

# Instantiate the GridSearchCV object
cv = GridSearchCV(pipeline, parameters)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Accuracy: 0.779591836735
             precision    recall  f1-score   support

      False       0.83      0.85      0.84       662
       True       0.67      0.63      0.65       318

avg / total       0.78      0.78      0.78       980

Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.1}


##### Conclusion
Great, the SVM model slightly outperforms the k-NN model-–77.0% compared to 77.9%, and we get tuned hyperparameters of 10 for C and 0.1 for Gamma. 

Precision is the fraction of predicted positives observations that are actually positive; it's given by `true positives / (true positive + false positives)`. The SVM model is 83% precise when predicting good quality wine and 67% precise in predicting bad quality wine. Recall, the fraction of positives observations that were predicted correctly, is given by `true postives / (true positives + false negatives)`, and the model recalls 85% predicted good quality wines and 63% predicted bad quality wines.