# Sklearn Preprocessing Techniques

- Label Encoder
- Label Binarizer

---

## Import modules

In [1]:
# standard

import pandas as pd

---

## Import Data

In [2]:
!ls


LICENSE                        preprocessing_data.xlsx
README.md                      preprocessing_techniques.ipynb


In [3]:
# Target has only 2 classes; 1 per sample

df0 = pd.read_excel('preprocessing_data.xlsx', sheet_name=0)
df0

Unnamed: 0,f1,f2,f3,t1
0,as,10,a,cold
1,sp,20,b,hot
2,as,30,c,hot
3,ks,40,d,hot
4,sp,50,e,cold


In [4]:
# Target more than 2 classes and each sample has 1 target only (multi-class)

df1 = pd.read_excel('preprocessing_data.xlsx', sheet_name=1)
df1

Unnamed: 0,f1,f2,f3,t1
0,as,10,a,blr
1,sp,20,b,nyc
2,as,30,c,lax
3,ks,40,d,nyc
4,sp,50,e,nyc


In [5]:
# Target has more than 2 classes and each sample might have more than 1 class (multi-label)

df2 = pd.read_excel('preprocessing_data.xlsx', sheet_name=2)
df2

Unnamed: 0,f1,f2,f3,t1
0,as,10,a,"['blr', 'che']"
1,sp,20,b,"['nyc', 'dal']"
2,as,30,c,['lax']
3,ks,40,d,['nyc']
4,sp,50,e,"['nyc', 'lax']"


---

## Label Encoder

- to be used on the target. not for features.
- works on multi-classes (1 label per sample but more than 2 classes)
- "works" with multi-labels but see note below on it not normally being the intention you use the label encoder with.

In [6]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

le.fit_transform(df1['t1'])

array([0, 2, 1, 2, 2])

In [7]:
print(type(le.fit_transform(df1['t1'])))
print(le.fit_transform(df1['t1']).shape)

# Output is a 1D array with the size of it being equal to the number of samples passed to the transform function

<class 'numpy.ndarray'>
(5,)


In [8]:
# Get a list of unique classes from the LabelEncoder

le.classes_

array(['blr', 'lax', 'nyc'], dtype=object)

In [9]:
# Tranform another data set

le.transform(['nyc'])

array([2])

In [10]:
# Throws an error for unseen labels

# le.transform(['SomethingElse'])

In [11]:
# Inverse Transform to get back the original class names

le.inverse_transform(le.transform(['nyc']))

array(['nyc'], dtype=object)

In [12]:
# Does it work for multilabels?

le2 = LabelEncoder()

le2.fit_transform(df2['t1'])

array([0, 2, 1, 4, 3])

In [13]:
le2.classes_

array(["['blr', 'che']", "['lax']", "['nyc', 'dal']", "['nyc', 'lax']",
       "['nyc']"], dtype=object)

NOTE: With Multi-labels, it "works" as in it does not throw out an error when it encounters a list of multiple targets per sample however, provided the intention is to create a label for a set (as opposed to each label within a list). Usually, you would want to create a label encoded value for every label.

---

## LabelBinarizer

- key thing is that it "binarizes" the labels - i.e., converts them to 0 and 1 in the form of an array based on the number of classes it is fit on.
- works with multi-class
- "works" with multi-labels but see note below on it not normally being the intention you use the label binarizer with.

In [14]:
# Recall df0

df0[['t1']]

Unnamed: 0,t1
0,cold
1,hot
2,hot
3,hot
4,cold


In [15]:
from sklearn.preprocessing import LabelBinarizer

lb0 = LabelBinarizer()

lb0.fit_transform(df0['t1'])

array([[0],
       [1],
       [1],
       [1],
       [0]])

In [16]:
lb0.classes_

array(['cold', 'hot'], dtype='<U4')

In [17]:
# There are 2 classes (hot and cold), so that array of each binarized target 
# contains 1 element which is either hot or cold i.e., [0] or [1]

print(lb0.fit_transform(df0['t1'])[0].shape)

lb0.fit_transform(df0['t1'])[0]

(1,)


array([0])

In [18]:
# Trying with multiclass

# Recall df1
df1[['t1']]

Unnamed: 0,t1
0,blr
1,nyc
2,lax
3,nyc
4,nyc


In [19]:
# Define a new label binarizer

lb1 = LabelBinarizer()

lb1.fit_transform(df1['t1'])

array([[1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1]])

In [20]:
# Unique classes

lb1.classes_

array(['blr', 'lax', 'nyc'], dtype='<U3')

In [21]:
# There are 3 classes ('blr', 'lax', 'nyc'), so that array of each binarized target 
# contains 3 elements which is either 'blr', 'lax', 'nyc' represented with 0s and 1s

print(lb1.fit_transform(df1['t1'])[0].shape)

lb1.fit_transform(df1['t1'])[0]

(3,)


array([1, 0, 0])

In [22]:
# As seen below, 'lax' is made up of 0s and 1s in a 1D array with 3 elements

lb1.transform(['lax'])

array([[0, 1, 0]])

In [23]:
lb1.inverse_transform(lb1.transform(['lax']))

array(['lax'], dtype='<U3')

In [24]:
# Trying with multi-label

In [25]:
# Recall df2

df2

Unnamed: 0,f1,f2,f3,t1
0,as,10,a,"['blr', 'che']"
1,sp,20,b,"['nyc', 'dal']"
2,as,30,c,['lax']
3,ks,40,d,['nyc']
4,sp,50,e,"['nyc', 'lax']"


In [26]:
# Define a new label binarizer

lb2 = LabelBinarizer()

lb2.fit_transform(df2['t1'])

array([[1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0]])

NOTE: LabelBinarizer also "works" with multi-label but normally, this is not the intention with which it is applied.


In [27]:
lb2.transform([['lax']])

array([[0, 0, 0, 0, 0]])

In [29]:
lb2.transform(['lax'])

array([[0, 0, 0, 0, 0]])

In [36]:
df2.loc[2,'t1']

type(df2.loc[2,'t1'])

str

In [38]:
import ast

df2['t1_ast'] = df2['t1'].apply(lambda x: ast.literal_eval(x))

df2

Unnamed: 0,f1,f2,f3,t1,t1_ast
0,as,10,a,"['blr', 'che']","[blr, che]"
1,sp,20,b,"['nyc', 'dal']","[nyc, dal]"
2,as,30,c,['lax'],[lax]
3,ks,40,d,['nyc'],[nyc]
4,sp,50,e,"['nyc', 'lax']","[nyc, lax]"


In [39]:
lb2.transform(df2.loc[2,'t1_ast'])

array([[0, 0, 0, 0, 0]])

In [40]:
lb2.transform(df2.loc[0,'t1_ast'])

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

---