# Chapter 5
# Categorical Variables: Counting Eggs in the Age of Robotic Chickens

- Used to represent categories or labels.
- The values may be represented nunerically, but cannot be ordered with respect to one another. Nonordinal!


## Encoding Categorical Variables

### One-Hot Encoding

- Implemented in scikit-learn as sklearn.preprocessing.OneHot
Encoder.
- The sum of all bits must be equal to 1.
- Linear dependent features are slightly annoying because they mean that the
trained linear models will not be unique. Different linear combinations of the
features can make the same predictions, so we would need to jump through extra
hoops to understand the effect of a feature on the prediction.

### Dummy Encoding

- Dummy encoding removes the extra degree of freedom by using only k-1 features in the representation.

In [0]:
import pandas as pd
from sklearn.linear_model import LinearRegression

In [0]:
# Define a toy dataset of apartment rental prices in
# New York, San Francisco, and Seattle
df = pd.DataFrame({'City': ['SF', 'SF', 'SF', 'NYC', 'NYC', 'NYC',
                            'Seattle', 'Seattle', 'Seattle'],
                   'Rent': [3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]})

In [0]:
df['Rent'].mean()

3333.3333333333335

In [0]:
one_hot_df = pd.get_dummies(df, prefix=['city'])
one_hot_df

Unnamed: 0,Rent,city_NYC,city_SF,city_Seattle
0,3999,0,1,0
1,4000,0,1,0
2,4001,0,1,0
3,3499,1,0,0
4,3500,1,0,0
5,3501,1,0,0
6,2499,0,0,1
7,2500,0,0,1
8,2501,0,0,1


In [0]:
model = LinearRegression().fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],
                                one_hot_df['Rent'])

In [0]:
model.coef_

array([ 166.66666667,  666.66666667, -833.33333333])

In [0]:
model.intercept_

3333.3333333333335

In [0]:
dummy_df = pd.get_dummies(df, prefix=['city'], drop_first=True)
dummy_df

Unnamed: 0,Rent,city_SF,city_Seattle
0,3999,1,0
1,4000,1,0
2,4001,1,0
3,3499,0,0
4,3500,0,0
5,3501,0,0
6,2499,0,1
7,2500,0,1
8,2501,0,1


In [0]:
model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])
model.coef_

array([  500., -1000.])

In [0]:
model.intercept_

3500.0

### Effect Coding

- Effect
coding is very similar to dummy coding, with the difference that the reference
category is now represented by the vector of all –1’s.

### Pros and Cons of Categorical Variable Encodings

- One-hot encoding is redundant, but the advantage is that each feature clearly corresponds to a category.
- Missing data can be encoded as the all zeros vector.
- Dummy coding and effect coding are not redundant.
- Effect coding handles better when we are working with missing values.
- Effect coding is pretty expensive to store and compute.

---

## Dealing with Large Categorical Variables

Solutions:
- Do nothing fancy with the encoding. Use a simple model that is cheap to train.
- Feature hashing, popular with linear models
- Bin counting, populat with linear models as well as trees

### Feature Hashing

A has function is a deterministic function that maps a potentially unbounded integer to a finite range [1, m]. Since the input domain is potentially
larger than the output range, multiple numbers may get mapped to the same
output. This is called a collision. A uniform hash function ensures that roughly
the same number of numbers are mapped into each of the m bins.

### Bin counting