## Categorical Encoding

Since text does not work well with models that like numbers, we need to encode categorical variables. This is just like dummy coding categorical variables in regression.

In [1]:
data = [
               {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
               {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
               {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
               {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}]

In [2]:
import pandas as pd
pd.DataFrame(data)

Unnamed: 0,price,rooms,neighborhood
0,850000,4,Queen Anne
1,700000,3,Fremont
2,650000,3,Wallingford
3,600000,2,Fremont


In [3]:
from sklearn.feature_extraction import DictVectorizer

In [4]:
vec = DictVectorizer(sparse=False, dtype=int)

In [7]:
# See how this is now dummy coded in binary.
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

In [8]:
# If we want to get the names of the features (column names), we can do this
vec.get_feature_names_out()

array(['neighborhood=Fremont', 'neighborhood=Queen Anne',
       'neighborhood=Wallingford', 'price', 'rooms'], dtype=object)

In [9]:
#check sparse=True
# This compresses all these 0s to save up space. If we have a feature with a bunch of categoricals,
# the data get get huge and messy.

In [10]:
vec = DictVectorizer(sparse=True, dtype=int)

In [11]:
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>