# Categorical Features to Numerical - one-hot encoding and ordinal encoding
* We'll shot a different method to transform categorical features to one-hot
encoded values.
* We'll show how to transform categorical to ordinal numerical features using
 pandas.

## DictVectorizer from sklearn
* Used for categorical values. We transform them into One-Hot Encoded.
* This does not transform them into Ordinal Numerals (hierarchical)

In [7]:
from sklearn.feature_extraction import DictVectorizer
# We start by defining a dictionary of Interest and Occupation that we need
# to transform to one-hot encoded vectors
X_dict = [{'interest': 'tech', 'occupation': 'professional'},
           {'interest': 'fashion', 'occupation': 'student'},
           {'interest': 'fashion','occupation':'professional'},
           {'interest': 'sports', 'occupation': 'student'},
           {'interest': 'tech', 'occupation': 'student'},
           {'interest': 'tech', 'occupation': 'retired'},
           {'interest': 'sports','occupation': 'professional'}]

# We create a DictVectorized model. Set the sparse matrix to false (We want
# to see all of the values).
dict_one_hot_encoder = DictVectorizer(sparse = False)

# Encoding our dictionary
X_encoded = dict_one_hot_encoder.fit_transform(X_dict)

print(X_encoded)

[[0. 0. 1. 1. 0. 0.]
 [1. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 1.]
 [0. 0. 1. 0. 1. 0.]
 [0. 1. 0. 1. 0. 0.]]


In [8]:
# We can also see the mapping by executing:
# This shows us the indexes where we'll find the '1's
print(dict_one_hot_encoder.vocabulary_)

{'interest=tech': 2, 'occupation=professional': 3, 'interest=fashion': 0, 'occupation=student': 5, 'interest=sports': 1, 'occupation=retired': 4}


In [9]:
# Transforming new data:
new_dict = [{'interest': 'sports', 'occupation': 'retired'}]
new_encoded = dict_one_hot_encoder.transform(new_dict)
print(new_encoded)


[[0. 1. 0. 0. 1. 0.]]


According to the vocabulary method, we can see that this new value is
interested in sports (index 1) and is retired (index 4).


In [10]:
# If we want to explicitly transform the encoded features back to the
# original ones, we use inverse_transform
print(dict_one_hot_encoder.inverse_transform(new_encoded))

[{'interest=sports': 1.0, 'occupation=retired': 1.0}]


In [11]:
# Showing what happens if we find categories that we did not encountered
# during training:
new_dict = [{'interest': 'unknown_interest', # Wasn't in the training set
               'occupation': 'retired'},
             {'interest': 'tech', 'occupation':
               'unseen_occupation'}] # Wasn't in the training set

new_encoded = dict_one_hot_encoder.transform(new_dict)
print(new_encoded)
print(dict_one_hot_encoder.inverse_transform(new_encoded))

[[0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0.]]
[{'occupation=retired': 1.0}, {'interest=tech': 1.0}]


Unlike the OneHotEncoder package, DictVector automatically handles unseen
data and it ignores it.

## Mapping Ordinal Categorical Values to Numeric Ordinals
* We use pandas for this

In [12]:
import pandas as pd
# Define a dataframe that has scores in Strings
df = pd.DataFrame({'score': ['low',
                             'high',
                             'medium',
                             'medium',
                             'low']})
print(df)

    score
0     low
1    high
2  medium
3  medium
4     low


In [14]:
# We define a dictionary of mapping values
mapping = {"low": 1, "medium": 2, "high": 3}

# We use the replace method to change the categories for numerical values
df['score'] = df['score'].replace(mapping)
print(df)

   score
0      1
1      3
2      2
3      2
4      1
