<h2>Introduction</h2>
<div style="font-size:18px;font-family:Calibri">
    The set of categories with no intrinsic ordering is called Nominal. E.g., Red, Green, Blue, etc. <br>
    The set of categories that has some natural ordering is called ordinal. E.g., Low, Medium, High, etc.    
</div>

In [79]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

import warnings
warnings.filterwarnings("ignore")

<h2>Encoding the Nominal Categorical Features</h2>
<div style="font-size:18px;font-family:Calibri">
    One-Hot encoding using Scikit Learn's LabelBinarizer.
</div>

In [13]:
feature = np.array([["Texas"], ["California"], ["Texas"], ["Delaware"], ["Texas"]])
one_hot = LabelBinarizer()
one_hot.fit_transform(feature)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

In [15]:
one_hot.classes_

array(['California', 'Delaware', 'Texas'], dtype='<U10')

In [17]:
one_hot.inverse_transform(one_hot.transform(feature))

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

In [21]:
pd.get_dummies(feature[:, 0])

Unnamed: 0,California,Delaware,Texas
0,False,False,True
1,True,False,False
2,False,False,True
3,False,True,False
4,False,False,True


In [23]:
multicass_features = [("Texas", "Florida"),
                     ("California", "Alabama"),
                     ("Texas", "Florida"),
                     ("Delware", "Florida"),
                     ("Texas", "Alabama")]
one_hot_multiclass = MultiLabelBinarizer()
one_hot_multiclass.fit_transform(multicass_features)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

In [25]:
one_hot_multiclass.classes_

array(['Alabama', 'California', 'Delware', 'Florida', 'Texas'],
      dtype=object)

<h2>Encoding the Ordinal Categorical Features</h2>
<div style="font-size:18px;font-family:Calibri">
    Use the panda DataFrame's replace method to transform the string labels to numerical equivalents.
</div>

In [38]:
df = pd.DataFrame({"Score": ["Low", "High", "Low", "Medium", "Low", "High"]})
scale_mapper = {"Low": 1, "Medium": 2, "High": 3}
df["Score_num"] = df.replace(scale_mapper)

In [40]:
df.head()

Unnamed: 0,Score,Score_num
0,Low,1
1,High,3
2,Low,1
3,Medium,2
4,Low,1


<h2>Encoding the Dictionary of Features</h2>

In [45]:
data = [{"Red": 2, "Blue": 4},
       {"Red": 4, "Yellow": 5},
       {"Green": 1, "Blue": 3},
       {"Red": 10, "Green": 6}]

dictvectorizer = DictVectorizer(sparse = False)
dictvectorizer.fit_transform(data)

array([[ 4.,  0.,  2.,  0.],
       [ 0.,  0.,  4.,  5.],
       [ 3.,  1.,  0.,  0.],
       [ 0.,  6., 10.,  0.]])

In [57]:
pd.DataFrame(dictvectorizer.fit_transform(data), columns = ["Blue", "Red", "Yellow", "Green"])

Unnamed: 0,Blue,Red,Yellow,Green
0,4.0,0.0,2.0,0.0
1,0.0,0.0,4.0,5.0
2,3.0,1.0,0.0,0.0
3,0.0,6.0,10.0,0.0


<h2>Imputing the Missing Class Values</h2>

In [66]:
x = np.array([[0, 2.10, 1.45],
             [1, 2, 90],
             [0, 3.4, 9.9],
             [1, 5.6, -89.9]])

x_with_nan = np.array([[np.nan, 0.87, 1.31],
                      [np.nan, 9.9, -2]])

clf = KNeighborsClassifier(3, weights = 'distance')
trained_model = clf.fit(x[:, 1:], x[:, 0])
imputed_values = trained_model.predict(x_with_nan[:, 1:])
imputed_values

array([0., 0.])

In [68]:
x_with_imputed = np.hstack((imputed_values.reshape(-1, 1), x_with_nan[:, 1:]))
np.vstack((x_with_imputed, x))

array([[  0.  ,   0.87,   1.31],
       [  0.  ,   9.9 ,  -2.  ],
       [  0.  ,   2.1 ,   1.45],
       [  1.  ,   2.  ,  90.  ],
       [  0.  ,   3.4 ,   9.9 ],
       [  1.  ,   5.6 , -89.9 ]])

In [76]:
xc = np.vstack((x_with_nan, x))
imputer = SimpleImputer(strategy = 'most_frequent')
imputer.fit_transform(xc)

array([[  0.  ,   0.87,   1.31],
       [  0.  ,   9.9 ,  -2.  ],
       [  0.  ,   2.1 ,   1.45],
       [  1.  ,   2.  ,  90.  ],
       [  0.  ,   3.4 ,   9.9 ],
       [  1.  ,   5.6 , -89.9 ]])

<h2>Handling Imbalanced Class</h2>

In [97]:
iris = load_iris()
features = iris.data
target = iris.target

In [99]:
features = features[40:,:]
target = target[40:]
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [101]:
target = np.where((target == 0), 0, 1)
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [106]:
weights = {0: .9, 1: 0.1}
RandomForestClassifier(class_weight = weights)

In [108]:
RandomForestClassifier(class_weight = "balanced")

In [118]:
i_class0 = np.where(target == 0)[0]
i_class1 = np.where(target == 1)[0]

In [120]:
n_class0 = len(i_class0)
n_class1 = len(i_class1)

<div style="font-size:18px;font-family:Calibri">
    In downsampling, we randomly sample without replacement from the majority class (i.e., a class with more observations) to create a new subset of observations equal in size to the minority class.
</div>

In [122]:
i_class1_downsampled = np.random.choice(i_class1, size = n_class0, replace=False)

In [124]:
np.hstack((target[i_class0], target[i_class1_downsampled]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [130]:
np.vstack((features[i_class0, :], features[i_class1_downsampled, :]))[0:5]

array([[5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4]])

<div style="font-size:18px;font-family:Calibri">
    In upsampling, for every observation in the majority class, we randomly select an observation from minority lass with replacement.
</div>

In [134]:
i_class0_upsampled = np.random.choice(i_class0, size = n_class1, replace=True)
np.concatenate((target[i_class0_upsampled], target[i_class1]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1])

In [138]:
np.vstack((features[i_class0_upsampled, :], features[i_class1, :]))[:4]

array([[5. , 3.5, 1.6, 0.6],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.6, 0.2],
       [5. , 3.3, 1.4, 0.2]])