# Pre Processing

#### Handling-Categorical-Variable

1. One-Hot-Encoding
2. Label-Encoding
3. Hashing
4. Backward-Difference-Encoding

# __Handling Categorical Variable__

### One Hot Encoding
In this method, each category is maped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features.<br><br>This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature.<br><br>One Hot Encoding is very popular. All categories can be represented by **N-1 (N= No of Category)** as that is sufficient to encode the one that is not included. Usually, for **Regression, N-1** (drop first or last column of One Hot Coded new feature ) is used, **but for classification, the recommendation is to use all N columns without as most of the tree-based algorithm builds a tree based on all available variables.**

In [43]:
my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                

df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
df = pd.get_dummies(df, columns = ['dummy'])
df
# Dummy variable are created 


Unnamed: 0,y,x,dummy_a,dummy_b,dummy_c
0,5,1,1,0,0
1,3,3,0,1,0
2,1,2,0,1,0
3,3,1,1,0,0
4,4,2,0,1,0
5,7,1,0,0,1
6,7,1,0,0,1


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

In [44]:
my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                


df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
df = pd.get_dummies(df, columns = ['dummy'], drop_first = True)
# to run the regression we want to get rid of the strings 'a', 'b', 'c' (obviously)
# and we want to get rid of one dummy variable to avoid the dummy variable trap
# arbitrarily chose "a", coefficients on "c" an "b" would show effect of "c" and "b"
# relative to "a"
df

Unnamed: 0,y,x,dummy_b,dummy_c
0,5,1,0,0
1,3,3,1,0
2,1,2,1,0
3,3,1,0,0
4,4,2,1,0
5,7,1,0,1
6,7,1,0,1


### Label Encoding
In this encoding, __each category is assigned a value from 1 through N__; here N is the number of categories for the feature. One major issue with this approach is that there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship.

In [45]:
my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                

df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
df

Unnamed: 0,y,dummy,x
0,5,a,1
1,3,b,3
2,1,b,2
3,3,a,1
4,4,b,2
5,7,c,1
6,7,c,1


In [46]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['dummy'] = le.fit_transform(df.dummy)
df

Unnamed: 0,y,dummy,x
0,5,0,1
1,3,1,3
2,1,1,2
3,3,0,1
4,4,1,2
5,7,2,1
6,7,2,1


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Hashing
Hashing converts categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space.
<br><br>With Hashing, the __number of dimensions will be far less__ than the number of dimensions with encoding like One Hot Encoding. This method is **advantageous when the cardinality of categorical is very high**.

In Feature Hashing, a vector of categorical variables gets converted to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space. With Feature Hashing, the number of dimensions will be far less than the number of dimensions with simple binary encoding a.k.a One Hot Encoding.

Let’s consider the case of a data set with 2 categorical variables, the first one with a cardinality of 70 and the second one with a cardinality of 50. With simple binary encoding you will have to introduce 118 (70 + 50 – 2) additional fields to replace the 2 categorical variable fields in the data set.

With One Hot Encoding, the distance between categorical variables in any pair of records in preserved in the new space of dimension 118. With Feature Hashing you can get away with much smaller dimension e.g 10 in this case while recognizing that inter record distances will not be fully preserved. Hash collision is the reason for the failure to preserve the distances, making the mapping less than perfect.


In [47]:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10)
D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(D)
f.toarray()


array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Backward Difference Encoding
In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.<br><br>This technique falls under the contrast coding system for categorical features. A feature of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables.

In [48]:
# !pip install category_encoders
import category_encoders as ce
# Specify the columns to encode then fit and transform
encoder = ce.backward_difference.BackwardDifferenceEncoder(cols= ['dummy'],)
encoder.fit(df)

# Only display the first 8 columns for brevity
encoder.transform(df,override_return_df = False )

Unnamed: 0,intercept,y,dummy_0,dummy_1,x
0,1,5,-0.666667,-0.333333,1
1,1,3,0.333333,-0.333333,3
2,1,1,0.333333,-0.333333,2
3,1,3,-0.666667,-0.333333,1
4,1,4,0.333333,-0.333333,2
5,1,7,0.333333,0.666667,1
6,1,7,0.333333,0.666667,1
