In [1]:
import sklearn.preprocessing as skprep
import torch as t
import numpy as np

Most of the preprocessors come in two forms - a function or a class. A class is better because it can remember the parameters used for processing the training set and then reuse them on the test set.

## Binarization
Converting a tensor of floats into zeros and ones based on a cutoff. Elements greater than the cutoff are replaced with ones, and elements less than are replaced with zeros. The built-in `Binarizer()` is useless. Much better to use the more common idiom of broadcast operators which is easier to read.

In [2]:
x = t.rand((3, 4))
x

tensor([[0.5191, 0.3862, 0.4847, 0.7220],
        [0.2666, 0.9218, 0.2798, 0.9778],
        [0.5845, 0.3524, 0.1679, 0.0645]])

In [3]:
x > 0.5

tensor([[ True, False, False,  True],
        [False,  True, False,  True],
        [ True, False, False, False]])

In [4]:
(x > 0.5).to(t.float32)

tensor([[1., 0., 0., 1.],
        [0., 1., 0., 1.],
        [1., 0., 0., 0.]])

In [5]:
skprep.Binarizer(threshold=0.5).transform(x)

array([[1., 0., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 0., 0.]], dtype=float32)

## Standardization
Standardization is when a column in the dataset is scaled to zero mean and unit variance. There are various types of scalers which operate on columns, Standardization is just one of the scalers. Other scalers scale the columns to some range, typically `[0, 1]`, scalers that take into account outliers, etc. See [sklearn documentation](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) for other scalers. Below is a demo for `StandardScaler`.

Contrast this with Normalization, which is when a row in the dataset is scaled to have unit norm. I haven't used it before so not showing the demo.

First lets generate a dataset where the columns have different known means and variances.

In [6]:
rng = np.random.default_rng()
c1 = rng.normal(10, 30, 4).reshape(-1, 1)
c2 = rng.normal(3, 1.5, 4).reshape(-1, 1)
c3 = rng.normal(50, 3, 4).reshape(-1, 1)

In [7]:
x = np.concatenate((c1, c2, c3), axis=1)
x

array([[ 24.16918477,   2.37329517,  50.55234226],
       [ 29.59843264,   4.95234935,  52.725133  ],
       [-10.10438308,   2.73730604,  49.73805577],
       [ -5.58256444,   3.02222418,  56.33356728]])

In [8]:
c1_mean = np.round(np.mean(x[:, 0]), 3)
c1_std = np.round(np.std(x[:, 0]), 3)

c2_mean = np.round(np.mean(x[:, 1]), 3)
c2_std = np.round(np.std(x[:, 1]), 3)

c3_mean = np.round(np.mean(x[:, 2]), 3)
c3_std = np.round(np.std(x[:, 2]), 3)

print(c1_mean, c1_std)
print(c2_mean, c2_std)
print(c3_mean, c3_std)

9.52 17.542
3.271 0.997
52.337 2.553


Lets examine the mean and standard deviation for the rows as well for good measure.

In [9]:
r1_mean = np.round(np.mean(x[0, :]), 3)
r1_std = np.round(np.std(x[0, :]), 3)

r2_mean = np.round(np.mean(x[1, :]), 3)
r2_std = np.round(np.std(x[1, :]), 3)

r3_mean = np.round(np.mean(x[2, :]), 3)
r3_std = np.round(np.std(x[2, :]), 3)

print(r1_mean, r1_std)
print(r2_mean, r2_std)
print(r3_mean, r3_std)

25.698 19.699
29.092 19.506
14.124 25.723


In [10]:
std_scaler = skprep.StandardScaler()

First calculate the params of the dataset.

In [11]:
std_scaler.fit(x)
print(std_scaler.mean_, std_scaler.scale_)

[ 9.52016747  3.27129369 52.33727458] [17.54241956  0.9974377   2.55258258]


With this, if we are given any new row, typically from the test set, with 3 columns/elements (same thing for a row) we can transform it using these params.

In [12]:
a = np.array([100., 100., 100.]).reshape(1, -1)
std_scaler.transform(a)

array([[ 5.15777383, 96.97719051, 18.67235395]])

Now transform the original dataset.

In [13]:
scaled_x = std_scaler.transform(x)
scaled_x

array([[ 0.83506253, -0.90030536, -0.69926526],
       [ 1.14455507,  1.68537409,  0.15194745],
       [-1.11869121, -0.5353594 , -1.01827021],
       [-0.86092639, -0.24970933,  1.56558802]])

Lets verify that the columns have zero mean and unit variance. The mean and variance of the rows will have changed but they will not neccessarily by zero mean and unit variance.

In [14]:
c1_mean = np.round(np.mean(scaled_x[:, 0]), 3)
c1_std = np.round(np.std(scaled_x[:, 0]), 3)

c2_mean = np.round(np.mean(scaled_x[:, 1]), 3)
c2_std = np.round(np.std(scaled_x[:, 1]), 3)

c3_mean = np.round(np.mean(scaled_x[:, 2]), 3)
c3_std = np.round(np.std(scaled_x[:, 2]), 3)

print(c1_mean, c1_std)
print(c2_mean, c2_std)
print(c3_mean, c3_std)

0.0 1.0
0.0 1.0
-0.0 1.0


In [15]:
r1_mean = np.round(np.mean(scaled_x[0, :]), 3)
r1_std = np.round(np.std(scaled_x[0, :]), 3)

r2_mean = np.round(np.mean(scaled_x[1, :]), 3)
r2_std = np.round(np.std(scaled_x[1, :]), 3)

r3_mean = np.round(np.mean(scaled_x[2, :]), 3)
r3_std = np.round(np.std(scaled_x[2, :]), 3)

print(r1_mean, r1_std)
print(r2_mean, r2_std)
print(r3_mean, r3_std)

-0.255 0.775
0.994 0.635
-0.891 0.255


## Label Encoding
Given an array, usually the **target** column, with `k` categorical values, this preprocessor transforms it into a numeric column with values in `[0, k-1]` range.

In [16]:
cities = ["paris", "paris", "tokyo", "amsterdam"]

In [17]:
le = skprep.LabelEncoder()

In [18]:
le.fit(cities)
le.classes_

array(['amsterdam', 'paris', 'tokyo'], dtype='<U9')

As can be seen, amsterdam will be encoded with 0, paris with 1, and tokyo with 2.

In [19]:
le.transform(cities)

array([1, 1, 2, 0])

I cannot specify the encoding by hand, i.e., if I want tokyo to be 0, there is no way that I know of to do that.

# Dataset Encoding
Instead of an array, if I have a 2D dataset where each column is a different categorical feature, then I cannot use a `LabelEncoder`. I have to use an `OrdinalEncoder` instead. It does the same thing, i.e., replaces each column with integers between `[0, k-1]`.

In [20]:
x = [['male', 'from US', 'uses Safari'], 
     ['female', 'from Europe', 'uses Firefox']]

In [21]:
enc = skprep.OrdinalEncoder()

In [22]:
enc.fit(x)
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [23]:
enc.transform(x)

array([[1., 1., 1.],
       [0., 0., 0.]])

This encoder cannot work on a single dimensional array. It will interpret each element as a column with a single unique value. We'd have to convert the array into a column vector first for this to work. This is shown below. 

In [24]:
a = np.array(cities).reshape(-1, 1)
enc = skprep.OrdinalEncoder()
enc.fit(a)
enc.categories_

[array(['amsterdam', 'paris', 'tokyo'], dtype='<U9')]

In [25]:
enc.transform(a)

array([[1.],
       [1.],
       [2.],
       [0.]])

This is why it is better to use `LabelEncoder` for such use cases. However, unlike `LabelEncoder` it is possible to specify the encoding manually.

In [26]:
enc = skprep.OrdinalEncoder(categories=[["male", "female"], ["from US", "from Europe"], ["uses Safari", "uses Firefox"]])
enc.fit(x)
enc.categories_

[array(['male', 'female'], dtype=object),
 array(['from US', 'from Europe'], dtype=object),
 array(['uses Safari', 'uses Firefox'], dtype=object)]

In [27]:
enc.transform(x)

array([[0., 0., 0.],
       [1., 1., 1.]])

## Label One Hot Encoding
Given a categorical array, the `LabelBinarizer` operator transforms it into a 2D matrix with each row being a one-hot encoded vector of the corresponding element in the input row.

In [28]:
enc = skprep.LabelBinarizer()
enc.fit(cities)
enc.classes_

array(['amsterdam', 'paris', 'tokyo'], dtype='<U9')

In [29]:
enc.transform(cities)

array([[0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

## Dataset One Hot Encoding
Similar to the categorical dataset above, we start with a 2D dataset with categorical columns. This encoder will blow up each column into a 2D sparse matrix with `k` columns. It returns a sparse tensor, which is why it needs to be converted by `toarray()`.

In [36]:
x = [['male', 'from US', 'uses Safari'], 
     ['female', 'from Europe', 'uses Firefox']]

In [37]:
enc = skprep.OneHotEncoder(handle_unknown="ignore")
enc.fit(x)
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [38]:
enc.transform(x).toarray()

array([[0., 1., 0., 1., 0., 1.],
       [1., 0., 1., 0., 1., 0.]])

The `handle_unknown="ignore"` is there so that if there is a row with some category value that was not in the training set (i.e., the dataset that the encoder was `fit`ted to, the one hot vector will have all zeros.

In [39]:
a = np.array(["female", "from US", "uses Chrome"]).reshape(1, -1)
enc.transform(a).toarray()

array([[1., 0., 0., 1., 0., 0.]])

As can be seen in the output array, the last two elements, which are the one-hot encoded vector for browser use, are both zeros.

As before, this will not work on a single dimensional array, it will only work with a column.

In [34]:
a = np.array(cities).reshape(-1, 1)
enc = skprep.OneHotEncoder()
enc.fit(a)
enc.categories_

[array(['amsterdam', 'paris', 'tokyo'], dtype='<U9')]

In [35]:
enc.transform(a).toarray()

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])