Imputer

In [1]:
import numpy as np
from sklearn.preprocessing import Imputer


In [17]:
imp = Imputer(missing_values='NaN',strategy='mean',axis=0)
imp.fit([[1,2],[np.nan,3],[7,6]])

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [18]:
X = [[np.nan,2],[6,np.nan],[7,6],[13,np.nan],[np.nan,np.nan],[0,np.nan]]
imp.transform(X)

array([[  4.        ,   2.        ],
       [  6.        ,   3.66666667],
       [  7.        ,   6.        ],
       [ 13.        ,   3.66666667],
       [  4.        ,   3.66666667],
       [  0.        ,   3.66666667]])

**Binarizer** :
Binarize data (set feature values to 0 or 1) according to a threshold
Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

In [20]:
X = [[1,-1],[0,0],[-11,11],[0,4],[-1,-1],]

In [25]:
from sklearn.preprocessing import Binarizer

In [28]:
binarizer = Binarizer().fit(X)

In [30]:
binarizer.transform(X) # default threshold is 0

array([[1, 0],
       [0, 0],
       [0, 1],
       [0, 1],
       [0, 0]])

In [31]:
# threshold value changed
binarizer = Binarizer(threshold=-5)

In [32]:
binarizer.transform(X)

array([[1, 1],
       [1, 1],
       [0, 1],
       [1, 1],
       [1, 1]])

**Labelbinarizer** : At learning time, this simply consists in learning one regressor or binary classifier per class. In doing so, one needs to convert multi-class labels to binary labels (belong or does not belong to the class). LabelBinarizer makes this process easy with the transform method. 
At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the inverse_transform method.

In [33]:
from sklearn.preprocessing import LabelBinarizer

In [34]:
lb = LabelBinarizer()

In [37]:
lb.fit([1,2,6,4,2,4,5,6,7,2,1])

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [38]:
lb.classes_

array([1, 2, 4, 5, 6, 7])

In [40]:
lb.transform([1,6,2])

array([[1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0]], dtype=int32)

In [41]:
# Binary targets transform to a column vector

In [43]:
lb.fit_transform(['yes','no','yes','yes'])

array([[1],
       [0],
       [1],
       [1]], dtype=int32)

In [55]:
import numpy as np
lb.fit(np.array([[0, 1, 1], [1, 0, 0]]))

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [56]:
lb.classes_

array([0, 1, 2])

In [58]:
lb.transform([2,1])

array([[0, 0, 1],
       [0, 1, 0]], dtype=int32)

LabelEncoder : This can be used to normalize the labels

In [2]:
from sklearn.preprocessing import LabelEncoder


In [3]:
le = LabelEncoder()

In [4]:
le.fit([1,2,2,6])

LabelEncoder()

In [5]:
le.classes_

array([1, 2, 6])

In [6]:
le.transform([1,6,2,1,6,6,1,1,2,2,2])

array([0, 2, 1, 0, 2, 2, 0, 0, 1, 1, 1], dtype=int32)

In [7]:
le.inverse_transform([0, 2, 1, 0, 2, 2, 0, 0, 1, 1, 1])

array([1, 6, 2, 1, 6, 6, 1, 1, 2, 2, 2])

In [8]:
# This can be used to transform non-numerical labels to numerical labels.

In [9]:
le.fit(['paris','mumbai','paris','London','mumbai',])

LabelEncoder()

In [10]:
le.classes_

array(['London', 'mumbai', 'paris'],
      dtype='<U6')

In [11]:

le.transform(['paris','paris','paris','paris','paris','London','mumbai','paris'])

array([2, 2, 2, 2, 2, 0, 1, 2], dtype=int32)

Normalization: Normalization is the process of scaling individual samples to have unit norm.

In [15]:
from sklearn.preprocessing import Normalizer,normalize

In [21]:
X = [[1,-1,2],[2,0,0],[0,1,-1]]
# this is helpful in Pipeline
nor = Normalizer()
nor.fit(X)

Normalizer(copy=True, norm='l2')

In [17]:
nor.transform([[1,1,1],[0,0,0]])

array([[ 0.57735027,  0.57735027,  0.57735027],
       [ 0.        ,  0.        ,  0.        ]])

In [22]:
X_nor = normalize(X,norm='l2')
X_nor

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

OneHotEncoder: Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete)features.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

In [23]:
from sklearn.preprocessing import OneHotEncoder

In [24]:
en = OneHotEncoder()

In [25]:
en.fit([[0,0,3],[1,1,0],[0,2,1],[1,0,2]])

OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

In [29]:
en.n_values_ 
#unique values in each col

array([2, 3, 4])

In [28]:
en.feature_indices_

array([0, 2, 5, 9], dtype=int32)