http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-16.html

In [1]:
%matplotlib inline
import numpy as np              
import pandas as pd
from ggplot import mtcars

ModuleNotFoundError: No module named 'ggplot'

**Centering and Scaling**

Numeric variables are often on different scales and cover different ranges, so they can't be easily compared. What's more, variables with large values can dominate those with smaller values when using certain modeling techniques. Centering and scaling is a common preprocessing task that puts numeric variables on a common scale so no single variable will dominate the others.
The simplest way to center data is to subtract the mean value from each data point. Subtracting the mean centers the data around zero and sets the new mean to zero. 

In [None]:
mtcars.head()

In [None]:
mtcars.index = mtcars.name       # Set row index to car name
del mtcars["name"]               # Drop car name column

colmeans = mtcars.sum()/mtcars.shape[0]  # Get column means

colmeans

**Subtracting Column Means**

With the column means in hand, we just need to subtract the column means from each row in an element-wise fashion to zero center the data. Pandas performs math operations involving DataFrames and columns on an element-wise row-by-row basis by default, so we can simply subtract our column means series from the data set to center it

In [None]:
centered = mtcars-colmeans
centered.describe()

** Scaling using Standard Deviation **

With zero-centered data, negative values are below average and positive values are above average.
Now that the data is centered, we'd like to put it all on a common scale. One way to put data on a common scale is to divide by the standard deviation. Standard deviation is a statistic that describes the spread of numeric data. The higher the standard deviation, the further the data points tend to be spread away from the mean value. You can get standard deviations with df.std():

In [None]:
column_deviations = mtcars.std(axis=0)   # Get column standard deviations

centered_and_scaled = centered/column_deviations 

centered_and_scaled.describe()

** Carefully take a look at the results above **

Notice that after dividing by the standard deviation, every variable now has a standard deviation of 1. At this point, all the columns have roughly the same mean and scale of spread about the mean.

** Let the Machine Do It**

Now that you have suffered through manually centering and scaling the data, let's take a look at performing common data preprocessing automatically using functions built into Python libraries. The Python library **scikit-learn**, a popular package for predictive modeling and data analysis, has preprocessing tools including a scale() function for centering and scaling data

In [None]:
from sklearn import preprocessing as prep

In [None]:
scaled_data = prep.scale(mtcars)  # Scale the data*
 
# Note: preprocessing.scale() returns ndarrays so we have to convert it back into a DataFrame.
scaled_cars = pd.DataFrame(scaled_data,    # Remake the DataFrame
                           index=mtcars.index,
                           columns=mtcars.columns)

print(scaled_cars.describe() )

**Carefully take a look at the results above**

Notice that the values are almost the same as those we calculated manually but not exactly the same. These small differences are likely due to rounding and details of the scikit-learn implementation of centering and scaling.