# Data Centering and Scaling
## An introduction to data centering, min-max normalization, and standardization

In this article, you will learn how to center and scale your data using common techniques. You will learn:

* How to center data and interpret centered data
* Why scaling your data is important
* How to scale data using two common methods:
    * Min-max Normalization
    * Standardization  
* When to choose normalization vs. standardization
Let’s get started!

### Data Centering

Data centering involves subtracting the mean of a data set from each data point so that the new mean is 0. Mathematically, this looks like:


$$
Xcentered_i = X_i - \mu
$$



where `X_i` is a datapoint and the Greek letter `μ` is the mean of all the `X` values.

For example, let’s take a look at a data set of ages for five individuals:
```python
ages = [24, 40, 28, 22, 56]
```


The mean age in this data set is **`34 years old`**.

To center our data, we subtract the mean from each data point in ages:

```python
centered_ages = [-10, 6, -6, -12, 22]
```

This centered data is useful because it tells us how far above or below the mean each data point is, giving us additional insight that we can’t get just by looking at the initial data set. For example, the age of the first individual is ten years below the average.

Note that, because the **sum of the centered values is 0 (-10 + 6 - 6 - 12 + 22 = 0), the mean of the centered data is 0.**


### Data Scaling

A common task for data analysts and scientists is to find trends in data by comparing features of data points. However, this task is made difficult when the features are on drastically different scales.

For instance, let’s consider a data set containing two features, **`age`** and **`income`**.

In general, a person’s age usually ranges from 0 to about 100 years. A person’s income, on the other hand, usually ranges from 0 to large amounts measured in the thousands of dollars. Clearly, age and income are two features that have vastly different ranges.

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/center-scaling/scaling.png width=1000>

This presents issues when trying to use many machine learning algorithms, which treat all dimensions equally regardless of their scale. The difference in one year of age is interpreted as exactly equal to the difference in one dollar of income. That makes no sense!

In other words, the income feature outweighs the importance of age because income is on a relatively huge scale. Take a look at the following image:

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/center-scaling/unnormalized.png width=600>


From this chart, it is impossible to notice any relationship between income and age – all of the data is squished to the left. This is because the feature with the larger scale (income) dominates the smaller feature (age).

We would like every datapoint to have the same scale so each feature contributes equally to the relationship. Data scaling lets us achieve this.

Two of the most commonly used data scaling techniques are:

* Min-max normalization
* Standardization

***
### Min-Max Normalization

Min-max normalization is one of the most simple and common ways to scale data.

For every feature in a data set, the minimum value of that feature is transformed into 0, the maximum value is transformed into 1, and every other value is transformed into a decimal between 0 and 1.

For example, if the minimum value of a feature is 10, and the maximum value is 30, then 20 would be transformed to 0.5 since it is halfway between 10 and 30.

The formula for min-max normalization is as follows:

$$
X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

Using min-max normalization, our previous chart of income vs. age looks like the following:

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/center-scaling/normalized.png width=600>

Notice that the data is now properly scaled – both age and income fall in the range of [0,1].

Let’s get some practice implementing min-max normalization in Python:

In [4]:
def min_max_normalize(lst):
  minimum = min(lst)
  maximum = max(lst)
  normalized = []

  for i in lst:
      if lst[0] ==0: 
        normalized.append(i/100)
      else:
        normalized.append((i - minimum) / (maximum-minimum))
  return normalized

# Uncomment these function calls to test your function:
print(min_max_normalize([0, 25, 50, 75, 10]))
# should print [0.0, 0.25, 0.5, 0.75, 1.0]
print(min_max_normalize([10, 12, 14]))
# should print [0.0, 0.5, 1.0]

[0.0, 0.25, 0.5, 0.75, 0.1]
[0.0, 0.5, 1.0]


One downside of min-max normalization is that it does not handle outliers very well. For example, if you have 99 values between 0 and 20, and one value is 100, then the 99 values will all be transformed to a value between 0 and 0.2 while the outlier is transformed to 1. This results in skewed data:


<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/center-scaling/outlier.png width=600>


In this example, normalizing the data fixed the skewing problem on the x-axis, but now the y-axis is causing problems.

As we’ll see shortly, standardization is a more robust data scaling method for dealing with outliers.

### Standardization

Standardization (also known as Z-score normalization) is another common data scaling technique.

Standardization involves subtracting the mean of each observation and then dividing by the standard deviation:

$$
z = \frac{value - mean}{{stdev}}
$$

Once standardization is complete, all the features will have a mean of zero, a standard deviation of one, and therefore, the same scale.

Unlike normalization, standardization does not have a bounding range. This means that even if you have outliers in your data, your standardized data will not be affected. Therefore, if your dataset has outliers, standardization is the preferred scaling technique.

Let’s see if you can use the formula above to implement standardization in Python:



In [5]:
def standardize(lst, mean, std_dev):
  standardized = [(i-mean) / std_dev for i in lst]
  return standardized

# Uncomment these function calls to test your standardize function:
print(standardize([1, 2, 3, 4, 5], 3.0, 1.41))
# should print [-1.418, -0.709, 0.0, 0.709, 1.418]
print(standardize([10, 15, 20], 15.0, 4.08))
# should print [-1.225, 0.0, 1.225]

[-1.4184397163120568, -0.7092198581560284, 0.0, 0.7092198581560284, 1.4184397163120568]
[-1.2254901960784315, 0.0, 1.2254901960784315]


***
### When to Normalize vs. Standardize?

Min-max normalization and standardization both have a similar goal of transforming features in data to have the same scale so that each feature is equally important. So when should you use min-max normalization vs. standardization?

There is not always a clear answer. Both normalization and standardization have their strengths as well as their drawbacks. For example, if you need your data to be on a 0-1 scale, then it makes sense to use min-max normalization. If you have outliers in your data, then it is best to use standardization (Z-score normalization) since it does not have a bounding range like min-max normalization does.

Keep in mind that not every data set requires normalization or standardization. If your data features do not have vastly different ranges, then scaling your data might not be necessary.

### Python Implementation

As you saw, it is possible to implement min-max normalization and standardization by writing your own Python functions. However, in practice, most data analysts and scientists use popular libraries such as scikit-learn, which makes it very easy to scale your data.

For example, to normalize your data you can import **`MinMaxScaler`** from the **`sklearn.preprocessing`** package and then make a simple function call:

```python
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
 
# read in data 
data = pd.read_csv('data.csv')
 
# normalize data 
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
```

Likewise, standardizing your data is easy to do in just a few lines of code:

```python
from sklearn.preprocessing import StandardScaler
import pandas as pd
 
# read in data 
data = pd.read_csv('data.csv')
 
# standardize data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
```

Later on, you’ll get practice with transforming your data in Python using real-world data. For now, let’s test your knowledge of min-max normalization and standardization with a few questions:

## Review

As you learned in this article, normalization and standardization are scaling techniques that are slightly different but have a similar motive: to put data features in the same scale so that no feature is dominated by the other. These techniques are widely used by data analysts and scientists.

In this article, you learned:

### Data centering:

* Data centering involves subtracting the mean of a data set from each data point so that the new mean is 0.
* Centered data is useful because it tells us how far above or below the mean each data point is.


### Min-max normalization:

* The goal of normalization is to put features with different ranges onto the same scale.
* For every feature in a data set, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
* A downside of normalization is that it does not handle outliers well
Formula:

$$
X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}
$$





### Standardization:

* Standardization, also known as Z-score normalization, involves subtracting the mean of each observation and then dividing by the standard deviation:

$$
z = \frac{value - mean}{{stdev}}
$$

* Once standardization is complete, all standardized features will have a mean of zero, a standard deviation of one, and therefore, the same scale.
* Unlike normalization, standardization does not have a bounding range. This means standardization can deal with outliers.


As a data analyst or scientist, it’s important to understand the data you’re working with and how to transform that data. You now have a better grasp of common data transformation techniques and why it’s important to transform your data in the first place.