Lesson 6: Prepare For Modeling by Pre-Processing Data

Your raw data may not be setup to be in the best shape for modeling.

Sometimes you need to preprocess your data in order to best present the inherent structure of the problem in your data to the modeling algorithms. In today’s lesson, you will use the pre-processing capabilities provided by the scikit-learn.

The scikit-learn library provides two standard idioms for transforming data. Each transform is useful in different circumstances: Fit and Multiple Transform and Combined Fit-And-Transform.

There are many techniques that you can use to prepare your data for modeling. For example, try out some of the following

    Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options.
    Normalize numerical data (e.g. to a range of 0-1) using the range option.
    Explore more advanced feature engineering such as Binarizing.

For example, the snippet below loads the Pima Indians onset of diabetes dataset, calculates the parameters needed to standardize the data, then creates a standardized copy of the input data.

In [1]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import feather

In [2]:
#url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

#names=['preg','plas','pres','skin','test','mass','pedi','age','class']

#data=pd.read_csv(url,names=names)
data = feather.read_dataframe('diabetes')

In [12]:
array=data.values
X=array[:,0:8]
Y=array[:,8]

In [13]:
scaler= StandardScaler().fit(X)
rescaledX=scaler.transform(X)

In [14]:
np.set_printoptions(precision=4)

In [15]:
rescaledX[0:5,:]

array([[ 0.6399,  0.8483,  0.1496,  0.9073, -0.6929,  0.204 ,  0.4685,
         1.426 ],
       [-0.8449, -1.1234, -0.1605,  0.5309, -0.6929, -0.6844, -0.3651,
        -0.1907],
       [ 1.2339,  1.9437, -0.2639, -1.2882, -0.6929, -1.1033,  0.6044,
        -0.1056],
       [-0.8449, -0.9982, -0.1605,  0.1545,  0.1233, -0.494 , -0.9208,
        -1.0415],
       [-1.1419,  0.5041, -1.5047,  0.9073,  0.7658,  1.4097,  5.4849,
        -0.0205]])

In [11]:
X.shape

(768, 8)