In this chapter, we will cover numerpus strategies for transforming raw numerical data into features purpose-built for machine leaning algorithms.

# Rescaling a Feature

You need to rescale the values of a numerical feature to be between two values

Use scikit-learn's 'MinMaxScaler' to rescale a feature array:

In [9]:
#load library
import numpy as np
from sklearn import preprocessing

#Create a example feature
feature = np.array([[-500.5],[-100.1],[0],[100.1],[900.9]])

#Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))

#scale feature
scale_feature = minmax_scale.fit_transform(feature)

#show scale feature 
scale_feature

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

# Standardizing a Feature

You want to transform a feature to have a mean of 0 and a standard deviation of 1

scikit-learn's 'StandardScaler' performs both transformations

In [10]:
#load library 
import numpy as np
from sklearn import preprocessing

#Create a example feature 
feature = np.array([[1,2,3],[6,7,10],[500.5,6,5],[400.2,5,0],[0,3,100.1]])

#Create scaler 
standard_scaler = preprocessing.StandardScaler()

#Transform the feature
standardized_feature = standard_scaler.fit_transform(feature)

#show feature
standardized_feature

array([[-0.81408043, -1.40182605, -0.53728211],
       [-0.79153472,  1.29399328, -0.3548876 ],
       [ 1.43823581,  0.75482941, -0.4851694 ],
       [ 0.98596891,  0.21566555, -0.61545119],
       [-0.81858957, -0.86266219,  1.99279031]])

In [11]:
#print mean and standard deviation
print("mean:",round(standardized_feature.mean()))
print("Standard deviation:",standardized_feature.std())

mean: 0.0
Standard deviation: 1.0


If our data has significant outliers, it can negatively impact our standardization by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the 'RobustScaler' method:

In [12]:
#Create scaler
robust_scaler = preprocessing.RobustScaler()

#Transform feature 
robust_scaler.fit_transform(feature)

array([[-1.25250501e-02, -1.00000000e+00, -2.85714286e-01],
       [ 0.00000000e+00,  6.66666667e-01,  7.14285714e-01],
       [ 1.23872745e+00,  3.33333333e-01,  0.00000000e+00],
       [ 9.87474950e-01,  0.00000000e+00, -7.14285714e-01],
       [-1.50300601e-02, -6.66666667e-01,  1.35857143e+01]])

# Normalizing Observations

You want to rescale the feature values of observations to have unit norm(a total length of 1)(正则化)

In [13]:
#load libraries 
import numpy as np
from sklearn.preprocessing import Normalizer

#Create example matrix
feature = np.array([[0.5,0.5],[1.1,3.4],[1.5,20.2],[1.63,34.4],[10.9,3.3]])

#Create normalizer
nomalizer = Normalizer(norm="l2")

#Transform feature
nomalizer.transform(feature)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

In [14]:
#Transform feature 
features_l2_norm = Normalizer(norm="l2").transform(feature)

features_l2_norm

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

In [15]:
#Transform feature 
features_l1_norm = Normalizer(norm="l1").transform(feature)

features_l1_norm

array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

In [16]:
#print sum
print("Sum of the first observation\'s values:",
     features_l1_norm[0,1]+features_l1_norm[0,0])

Sum of the first observation's values: 1.0


# Generating Polynomial and Interaction Features

Even though some choose to create polynomial and interaction features manually,scikit-learn offers a built-in method

In [1]:
#load libraries 
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

#Create feature matrix
feature = np.array([[2,3],[4,5],[3,4]])

#Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=2,include_bias=False)

#Create polynomial feature
polynomial_interaction.fit_transform(feature)

array([[ 2.,  3.,  4.,  6.,  9.],
       [ 4.,  5., 16., 20., 25.],
       [ 3.,  4.,  9., 12., 16.]])

"degree=2"-->raised to the second power,"degree=3"-->raised to the second and third power

In [2]:
polynomial_interaction3 = PolynomialFeatures(degree=3,include_bias=False)
polynomial_interaction3.fit_transform(feature)

array([[  2.,   3.,   4.,   6.,   9.,   8.,  12.,  18.,  27.],
       [  4.,   5.,  16.,  20.,  25.,  64.,  80., 100., 125.],
       [  3.,   4.,   9.,  12.,  16.,  27.,  36.,  48.,  64.]])

In [3]:
interaction = PolynomialFeatures(degree=2,interaction_only=True,include_bias=False)
interaction.fit_transform(feature)

array([[ 2.,  3.,  6.],
       [ 4.,  5., 20.],
       [ 3.,  4., 12.]])

# Transforming Features

You want to make a custom transformation to one or more features

In [4]:
#load libraries
import numpy as np
from sklearn.preprocessing import FunctionTransformer

#Create feature matrix
feature = np.array([[2,3],[2,3],[2,3]])

#Define a simple function
def add_ten(x):
    return x+10

#Create transformer 
ten_transformer = FunctionTransformer(add_ten)

#Transform feature matrix
ten_transformer.transform(feature)



array([[12, 13],
       [12, 13],
       [12, 13]])

We can create the same transformation in pandas using "apply"

In [5]:
#load libraries 
import pandas as pd

#Create data frame
df = pd.DataFrame(feature,columns=["feature1","feature2"])
df.apply(add_ten)

  return f(*args, **kwds)
  return f(*args, **kwds)


Unnamed: 0,feature1,feature2
0,12,13
1,12,13
2,12,13


# Delecting Outliers

A common method is to assume the data is normally distributed and based on that assumption "draw" an ellipse around the data, classifying any observation inside the ellipse as an inlier(labeled as 1) and any observation outside the ellipse as an outlier(labeled as -1)

In [14]:
#Load libraries 
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

#Create simulated data
feature,_=make_blobs(n_samples=10,
                     n_features=2,
                     centers=1,
                     random_state=1)
feature

array([[-1.83198811,  3.52863145],
       [-2.76017908,  5.55121358],
       [-1.61734616,  4.98930508],
       [-0.52579046,  3.3065986 ],
       [ 0.08525186,  3.64528297],
       [-0.79415228,  2.10495117],
       [-1.34052081,  4.15711949],
       [-1.98197711,  4.02243551],
       [-2.18773166,  3.33352125],
       [-0.19745197,  2.34634916]])

In [15]:
#Replace the first observation's values with extreme values
feature[4,0]=1000
feature[4,1]=1000

#Create detector
outlier_detector = EllipticEnvelope(contamination = .1)

#fit detector
outlier_detector.fit(feature)

#predict outlier
outlier_detector.predict(feature)

array([ 1,  1,  1,  1, -1,  1,  1,  1,  1,  1])