http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-16.html

In [4]:
%matplotlib inline
import numpy as np              
import pandas as pd
from ggplot import mtcars

**Centering and Scaling**

Numeric variables are often on different scales and cover different ranges, so they can't be easily compared. What's more, variables with large values can dominate those with smaller values when using certain modeling techniques. Centering and scaling is a common preprocessing task that puts numeric variables on a common scale so no single variable will dominate the others.
The simplest way to center data is to subtract the mean value from each data point. Subtracting the mean centers the data around zero and sets the new mean to zero. 

In [5]:
mtcars.head()

Unnamed: 0,name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [6]:
mtcars.index = mtcars.name       # Set row index to car name
del mtcars["name"]               # Drop car name column

colmeans = mtcars.sum()/mtcars.shape[0]  # Get column means

colmeans

mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596563
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64

**Subtracting Column Means**

With the column means in hand, we just need to subtract the column means from each row in an element-wise fashion to zero center the data. Pandas performs math operations involving DataFrames and columns on an element-wise row-by-row basis by default, so we can simply subtract our column means series from the data set to center it

In [7]:
centered = mtcars-colmeans
centered.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,3.996803e-15,0.0,-3.907985e-14,0.0,-5.967449e-16,4.787837e-16,-2.609024e-15,0.0,0.0,0.0,0.0
std,6.026948,1.785922,123.9387,68.562868,0.5346787,0.9784574,1.786943,0.504016,0.498991,0.737804,1.6152
min,-9.690625,-2.1875,-159.6219,-94.6875,-0.8365625,-1.70425,-3.34875,-0.4375,-0.40625,-0.6875,-1.8125
25%,-4.665625,-2.1875,-109.8969,-50.1875,-0.5165625,-0.636,-0.95625,-0.4375,-0.40625,-0.6875,-0.8125
50%,-0.890625,-0.1875,-34.42188,-23.6875,0.0984375,0.10775,-0.13875,-0.4375,-0.40625,0.3125,-0.8125
75%,2.709375,1.8125,95.27812,33.3125,0.3234375,0.39275,1.05125,0.5625,0.59375,0.3125,1.1875
max,13.80938,1.8125,241.2781,188.3125,1.333437,2.20675,5.05125,0.5625,0.59375,1.3125,5.1875


** Scaling using Standard Deviation **

With zero-centered data, negative values are below average and positive values are above average.
Now that the data is centered, we'd like to put it all on a common scale. One way to put data on a common scale is to divide by the standard deviation. Standard deviation is a statistic that describes the spread of numeric data. The higher the standard deviation, the further the data points tend to be spread away from the mean value. You can get standard deviations with df.std():

In [8]:
column_deviations = mtcars.std(axis=0)   # Get column standard deviations

centered_and_scaled = centered/column_deviations 

centered_and_scaled.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,6.678685e-16,-6.938894e-18,-2.94903e-16,-2.4286130000000003e-17,-1.113692e-15,4.909267e-16,-1.465841e-15,1.387779e-17,8.326673e-17,-5.0306980000000006e-17,1.387779e-17
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.607883,-1.224858,-1.28791,-1.381032,-1.564608,-1.741772,-1.87401,-0.8680278,-0.8141431,-0.9318192,-1.122152
25%,-0.7741273,-1.224858,-0.8867035,-0.7319924,-0.9661175,-0.6500027,-0.5351317,-0.8680278,-0.8141431,-0.9318192,-0.5030337
50%,-0.1477738,-0.1049878,-0.2777331,-0.3454858,0.1841059,0.1101223,-0.07764656,-0.8680278,-0.8141431,0.4235542,-0.5030337
75%,0.4495434,1.014882,0.7687521,0.4858679,0.6049193,0.4013971,0.5882951,1.116036,1.189901,0.4235542,0.7352031
max,2.291272,1.014882,1.946754,2.746567,2.493904,2.255336,2.826755,1.116036,1.189901,1.778928,3.211677


** Carefully take a look at the results above **

Notice that after dividing by the standard deviation, every variable now has a standard deviation of 1. At this point, all the columns have roughly the same mean and scale of spread about the mean.

** Let the Machine Do It**

Now that you have suffered through manually centering and scaling the data, let's take a look at performing common data preprocessing automatically using functions built into Python libraries. The Python library **scikit-learn**, a popular package for predictive modeling and data analysis, has preprocessing tools including a scale() function for centering and scaling data

In [9]:
from sklearn import preprocessing as prep

In [11]:
scaled_data = prep.scale(mtcars)  # Scale the data*
 
# Note: preprocessing.scale() returns ndarrays so we have to convert it back into a DataFrame.
scaled_cars = pd.DataFrame(scaled_data,    # Remake the DataFrame
                           index=mtcars.index,
                           columns=mtcars.columns)

print(scaled_cars.describe() )

                mpg           cyl          disp            hp          drat  \
count  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01   
mean  -5.481726e-16  4.163336e-17  1.387779e-16 -1.734723e-17 -3.122502e-16   
std    1.016001e+00  1.016001e+00  1.016001e+00  1.016001e+00  1.016001e+00   
min   -1.633610e+00 -1.244457e+00 -1.308518e+00 -1.403130e+00 -1.589643e+00   
25%   -7.865141e-01 -1.244457e+00 -9.008917e-01 -7.437050e-01 -9.815764e-01   
50%   -1.501383e-01 -1.066677e-01 -2.821771e-01 -3.510140e-01  1.870518e-01   
75%    4.567366e-01  1.031121e+00  7.810529e-01  4.936423e-01  6.145986e-01   
max    2.327934e+00  1.031121e+00  1.977904e+00  2.790515e+00  2.533809e+00   

                 wt          qsec            vs            am          gear  \
count  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01  3.200000e+01   
mean   4.683753e-17 -1.469311e-15 -6.938894e-18  5.551115e-17 -1.144917e-16   
std    1.016001e+00  1.016001e+00  1.016001e+00  1.

**Carefully take a look at the results above**

Notice that the values are almost the same as those we calculated manually but not exactly the same. These small differences are likely due to rounding and details of the scikit-learn implementation of centering and scaling.