# Standardizing Data
  
This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.

In [39]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Standardization
  
It's possible that you'll come across datasets with lots of numerical noise, perhaps due to feature variance or differently-scaled data. The preprocessing solution for that is standardization.
  
**What is standardization?**
  
Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn, this is often a necessary step, because many models make underlying assumptions that the training data is normally distributed, and if it isn't, we could risk risk biasing your model. Data can be standardized in many different ways, but in this course, we're going to talk about two methods: log normalization and scaling. 
  
It's also important to note that standardization is a preprocessing method applied to continuous, numerical data. We'll cover methods for dealing with categorical data later in the course.
  
**When to standardize: linear distances**
  
There are a few different scenarios in which we'd want to standardize your data. First, if we're working with any kind of model that uses a linear distance metric or operates in a linear space like k-nearest neighbors, linear regression, or k-means clustering, the model is assuming that the data and features we're giving it are related in a linear fashion, or can be measured with a linear distance metric, which may not always be the case.
  
**When to standardize: high variance**
  
Standardization should also be used when dataset features have a high variance, which is also related to distance metrics. This could bias a model that assumes the data is normally distributed. If a feature in our dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.
  
**When to standardize: different scales**
  
Modeling a dataset that contains continuous features that are on different scales is another standardization scenario. For example, consider predicting house prices using two features: the number of bedrooms and the last sale price. These two features are on vastly different scales, which will confuse most models. To compare these features, we must standardize them to put them in the same linear space. All of these scenarios assume we're working with a model that makes some kind of linearity assumptions; however, there are a number of models that are perfectly fine operating in a nonlinear space, or do a certain amount of standardization upon input, but they're outside the scope of this course.

### When to standardize
  
**Now that you've learned when it is appropriate to standardize your data, which of these scenarios is NOT a reason to standardize?**
  
Possible Answers  
  
- [ ] A column you want to use for modeling has extremely high variance.

- [ ] You have a dataset with several continuous columns on different scales, and you'd like to use a linear model to train the data.

- [ ] The models you're working with use some sort of distance metric in a linear space.

- [x] Your dataset is comprised of categorical data.
  
Correct! Standardization is a preprocessing task performed on numerical, continuous data.

### Modeling without normalizing
  
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first.
  
Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.
  
The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.
  
1. Split up the X and y sets into training and test sets, ensuring that class labels are equally distributed in both sets.
  
2. Fit the knn model to the training features and labels.
  
3. Print the test set accuracy of the knn model using the `.score()` method.


In [40]:
wine = pd.read_csv('../_datasets/wine_types.csv')
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [41]:
# X/y split
X, y = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']], wine['Type'] 

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


# Instanciate the KNN model
knn = KNeighborsClassifier()

# Seeding
SEED = 42

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=SEED)

# Fit the knn model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.6888888888888889


You can see that the accuracy score is pretty low at (69%).  
Let's explore methods to improve this score.

## Log normalization
  
The first method we'll cover for standardization is log normalization.
  
**What is log normalization?**
  
Log normalization is a method for standardizing data that can be useful when we have features with high variance. Log normalization applies a logarithmic transformation to our values, which transforms them onto a scale that approximates normality - an assumption that many models make. The method of log normalization we're going to work with takes the natural log of each number; this is the exponent you would raise above the mathematical constant e (approximately equal to 2.718) to get that number.
  
**What is log normalization?**
  
Looking at the following table, the log of 30 is 3.4, because e to the power of 3.4 equals 30. Log normalization is a good strategy when you care about relative changes in a linear model, but still want to capture the magnitude of change, and when we want to keep everything in the positive space. It's a nice way to minimize the variance of a column and make it comparable to other columns for modeling.
  
**Log normalization in Python**
  
Applying log normalization to data in Python is fairly straightforward. We can use the `log()` function from NumPy to do the transformation. 
  
Here we have a DataFrame of some values. If we check the variance of the columns, we can see that column 2 has a significantly higher variance than column 1, which makes it a clear candidate for log normalization. To apply log normalization to column 2, we need the `log()` function from numpy. We can pass the column we want to log normalize directly into the function. If we take a look at both column 2 and the log-normalized column-2, we can see that the transformation has scaled down the values. If we check the variance of both column 1 and the log-normalized column 2, we can see that the variances are now much closer together.

### Checking the variance
  
Check the variance of the columns in the wine dataset. Out of the four columns listed, which column is the most appropriate candidate for normalization?  
  
in: `wine.var()`  
out:  
<table>
  <tr>
    <th>Type</th>
    <td>0.601</td>
  </tr>
  <tr>
    <th>Alcohol</th>
    <td>0.659</td>
  </tr>
  <tr>
    <th>Malic acid</th>
    <td>1.248</td>
  </tr>
  <tr>
    <th>Ash</th>
    <td>0.075</td>
  </tr>
  <tr>
    <th>Alcalinity of ash</th>
    <td>11.153</td>
  </tr>
  <tr>
    <th>Magnesium</th>
    <td>203.989</td>
  </tr>
  <tr>
    <th>Total phenols</th>
    <td>0.392</td>
  </tr>
  <tr>
    <th>Flavanoids</th>
    <td>0.998</td>
  </tr>
  <tr>
    <th>Nonflavanoid phenols</th>
    <td>0.015</td>
  </tr>
  <tr>
    <th>Proanthocyanins</th>
    <td>0.328</td>
  </tr>
  <tr>
    <th>Color intensity</th>
    <td>5.374</td>
  </tr>
  <tr>
    <th>Hue</th>
    <td>0.052</td>
  </tr>
  <tr>
    <th>OD280/OD315 of diluted wines</th>
    <td>0.504</td>
  </tr>
  <tr>
    <th>Proline</th>
    <td>99166.717</td>
  </tr>
</table>
  
Correct! The Proline column has an extremely high variance.



### Log normalization in Python
  
Now that we know that the Proline column in our wine dataset has a large amount of variance, let's log normalize it.
  
1. Print out the variance of the Proline column for reference.
  
2. Use the `np.log()` function on the Proline column to create a new, log-normalized column named Proline_log.
  
3. Print out the variance of the Proline_log column to see the difference.

In [43]:
# Print out the variance of the Proline column
print(wine.Proline.var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine.Proline_log.var())

99166.71735542436
0.17231366191842012


The `np.log()` function is an easy way to log normalize a column.

## Scaling data for feature comparison
  
**What is feature scaling?**
  
Scaling is a method of standardization that's most useful when we're working with a dataset that contains continuous features that are on different scales, and we're using a model that operates in some sort of linear space (like linear regression or k-nearest neighbors). Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one. This will make it easier to linearly compare features, which is a requirement for many models in scikit-learn.
  
**How to scale data**
  
Let's take a look at another DataFrame. In each column, we have numbers that have consistent scales within columns, but not across columns. If we look at the variance, it's relatively low across columns. To better model this data, scaling would be a good choice here.
  
Scikit-learn has a variety of scaling methods, but we'll focus on `StandardScaler()`, which is imported from `sklearn.preprocessing`. This method works by subtracting the mean and scaling each feature to have a variance of one. Once we instantiate a `StandardScaler()`, we can apply the `.fit_transform()` method on the DataFrame. We can convert the output of `.fit_transform()`, which is a numpy array, to a DataFrame to look at it more easily. If we take a look at the newly scaled DataFrame, we can see that the values have been scaled down, and if we calculate the variance by column, it's not only close to 1, but it's now the same for all of our features.

### Scaling data - investigating columns
  
You want to use the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model.

In [44]:
print(wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe())  # Summary Statistics

              Ash  Alcalinity of ash   Magnesium
count  178.000000         178.000000  178.000000
mean     2.366517          19.494944   99.741573
std      0.274344           3.339564   14.282484
min      1.360000          10.600000   70.000000
25%      2.210000          17.200000   88.000000
50%      2.360000          19.500000   98.000000
75%      2.557500          21.500000  107.000000
max      3.230000          30.000000  162.000000


In [45]:
print(wine[['Ash', 'Alcalinity of ash', 'Magnesium']].var())  # Variance

Ash                    0.075265
Alcalinity of ash     11.152686
Magnesium            203.989335
dtype: float64


Understanding your data is a crucial first step before deciding on the most appropriate standardization technique.

### Scaling data - standardizing columns
  
Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.
  
1. Import the StandardScaler class.

2. Instantiate a `StandardScaler()` and store it in the variable, scaler.
  
3. Create a subset of the wine DataFrame containing the Ash, Alcalinity of ash, and Magnesium columns, assign it to wine_subset.
  
4. Fit and transform the standard scaler to wine_subset.

In [46]:
from sklearn.preprocessing import StandardScaler


# Creating the scaler
scaler = StandardScaler()

# Take a subset of the DataFrame you want to scale
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

print(wine_subset.iloc[:3], '\n')

# Apply the scaler to the DataFrame subset
wine_subset_scaled = scaler.fit_transform(wine_subset)

print(wine_subset_scaled[:3])

    Ash  Alcalinity of ash  Magnesium
0  2.43               15.6        127
1  2.14               11.2        100
2  2.67               18.6        101 

[[ 0.23205254 -1.16959318  1.91390522]
 [-0.82799632 -2.49084714  0.01814502]
 [ 1.10933436 -0.2687382   0.08835836]]


In scikit-learn, running `.fit_transform()` during preprocessing will both fit the method to the data as well as transform the data in a single step.

## Standardized data and modeling
  
Now that we've learned a couple of different methods for standardization, it's time to see how this fits into the modeling workflow. As mentioned before, many models in scikit-learn require our data to be scaled appropriately across columns, otherwise we risk biasing the results.
  
**K-nearest neighbors**
  
You should already be a little familiar with both k-nearest neighbors, as well as the scikit-learn workflow, based on previous courses, but we'll do a quick review of both. K-nearest neighbors is a model that classifies data based on its distance to training set data. A new data point is assigned a label based on the class that the majority of surrounding data points belong to. 
  
**General workflow for ML modeling**

The workflow for training a model in scikit-learn starts with splitting the data into a training and test set. This can be done with scikit-learn's `train_test_split()` function. Splitting the data will allow us to evaluate the model's performance using unseen data, rather than evaluating its performance on the data it was trained on. 
  
Once the data has been split, we can begin preprocessing the training data. It's really important to split the data prior to preprocessing, so none of the test data is used to train the model. When non-training data is used to train the model, this is called data-leakage, and it should be avoided so that any performance metrics are reflective of the model's ability to generalize to unseen data. 
  
We instantiate a k-neighbors classifier and a standard scaler to scale our features. Here, we preprocess and fit the training features using the `.fit_transform()` method, and preprocess the test features using the `.transform()` method. Using the `.transform()` method means that the test features won't be used to fit the model and avoids data leakage. 
  
Now that we've finished preprocessing, we can fit the KNN model to the scaled training features, and return the test set accuracy using the `.score()` method on the scaled test features and test labels.

### KNN on non-scaled data
  
Before adding standardization to your scikit-learn workflow, you'll first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data.
  
1. Split the dataset into training and test sets.
  
2. Fit the knn model to the training data.
  
3. Print out the test set accuracy of your trained knn model.

In [47]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split


# Load data
wine = pd.read_csv('../_datasets/wine_types.csv')

# X/y split
X, y = wine.drop('Type', axis=1), wine['Type'] 

# Seeding
SEED = 42

# Instanciate KNN
knn = KNeighborsClassifier()

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=SEED)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7777777777777778


This accuracy definitely isn't poor, but let's see if we can improve it by standardizing the data.

### KNN on scaled data
  
The accuracy score on the unscaled wine dataset was decent (77.78%), but let's see what you can achieve by using standardization.
  
1. Create the `StandardScaler()` method, stored in a variable named scaler.
  
2. Scale the training and test features, being careful not to introduce *data-leakage*.
  
3. Fit the knn model to the scaled training data.
  
4. Evaluate the model's performance by computing the test set accuracy.

Use `.fit_transform()` when scaling the training features.  
Use `.transform()` when scaling the test features.  
Use `.fit()` when fitting the knn model to the scaled training features.

In [48]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Load data
wine = pd.read_csv('../_datasets/wine_types.csv')

# Seeding
SEED = 42

# X/y split
X, y = wine.drop('Type', axis=1), wine['Type']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=SEED)

# Instantiate KNN
knn = KNeighborsClassifier()

# Instantiate a StandardScaler
scaler = StandardScaler()

# Scale the training and test features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train_scaled, y_train)

# Score the model on the test data
print(knn.score(X_test_scaled, y_test))

0.9333333333333333


That's quite the improvement, and definitely made scaling the data worthwhile.