# Categorical Features in Regression Models

So far, we have fit linear and $k$-nearest neighbors regression models to data where all of the features are quantitative. But what if all or some of the features are categorical? In theory, the solution is simple: we simply transform the categorical variables into quantitative variables using dummy (i.e., one-hot) encoding, following the process in Chapter 3. However, in practice, some care is needed to ensure that the categorical variables are transformed in a consistent way between the training and the test data.

In [1]:
import pandas as pd
df_housing = pd.read_csv("AmesHousing.txt", sep="\t")
df_housing.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


## One Categorical Feature

Let's develop some intuition about the predictions that a regression model will make when there is a single categorical feature. First, suppose we train a linear regression model to predict house price from the neighborhood the house is in.

In [2]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

X = df_housing[["Neighborhood"]] # need 2D array for sklearn
y = df_housing["SalePrice"]

enc = OneHotEncoder()
X_dummies = enc.fit_transform(X)

model = LinearRegression()
model.fit(X_dummies, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

A regression model with just a single feature, **Neighborhood**, will predict the same price for all houses in the same neighborhood. What is that predicted value? We can obtain it by applying the `OneHotEncoder` to a list of the unique neighborhoods in the data set and passing this to `model.predict()`.

One way to obtain a list of the unique neighborhoods is inside the encoder itself, under the attribute `.categories_`. We convert this to a 2D-array to be compatible with scikit-learn.

In [3]:
X_test = pd.Series(enc.categories_[0], name="Neighborhood").to_frame()
X_test

Unnamed: 0,Neighborhood
0,Blmngtn
1,Blueste
2,BrDale
3,BrkSide
4,ClearCr
5,CollgCr
6,Crawfor
7,Edwards
8,Gilbert
9,Greens


In [4]:
model.predict(enc.transform(X_test))

array([196661.67865072, 143590.00007301, 105608.33341373, 124756.2501338 ,
       208662.09099546, 201803.43407003, 207550.83507981, 130843.38193788,
       190646.57601699, 193531.25007232, 280000.00007041, 103752.90334449,
       137000.00007012,  95756.48656997, 162226.63171993, 145097.34981454,
       140710.86964306, 188406.90856323, 330319.12686238, 322018.265324  ,
       123991.89003842, 135071.93758878, 136751.15252867, 184070.18415638,
       229707.3245364 , 324229.19616865, 246599.54176915, 248314.58341154])

It is a bit hard to tell which prediction corresponds to which neighborhood. Let's put these numbers into a `Series`, indexed by the neighborhood.

In [5]:
pd.Series(
    model.predict(enc.transform(X_test)),
    index=X_test["Neighborhood"]
)

Neighborhood
Blmngtn    196661.678651
Blueste    143590.000073
BrDale     105608.333414
BrkSide    124756.250134
ClearCr    208662.090995
CollgCr    201803.434070
Crawfor    207550.835080
Edwards    130843.381938
Gilbert    190646.576017
Greens     193531.250072
GrnHill    280000.000070
IDOTRR     103752.903344
Landmrk    137000.000070
MeadowV     95756.486570
Mitchel    162226.631720
NAmes      145097.349815
NPkVill    140710.869643
NWAmes     188406.908563
NoRidge    330319.126862
NridgHt    322018.265324
OldTown    123991.890038
SWISU      135071.937589
Sawyer     136751.152529
SawyerW    184070.184156
Somerst    229707.324536
StoneBr    324229.196169
Timber     246599.541769
Veenker    248314.583412
dtype: float64

Could we have obtained these predictions some other way, without going through the trouble of fitting a linear regression model? Intuitively, if all we knew about a house was the neighborhood it was in, we would predict the average price of houses in that neighborhood.

In [6]:
df_housing.groupby("Neighborhood")["SalePrice"].mean()

Neighborhood
Blmngtn    196661.678571
Blueste    143590.000000
BrDale     105608.333333
BrkSide    124756.250000
ClearCr    208662.090909
CollgCr    201803.434457
Crawfor    207550.834951
Edwards    130843.381443
Gilbert    190646.575758
Greens     193531.250000
GrnHill    280000.000000
IDOTRR     103752.903226
Landmrk    137000.000000
MeadowV     95756.486486
Mitchel    162226.631579
NAmes      145097.349887
NPkVill    140710.869565
NWAmes     188406.908397
NoRidge    330319.126761
NridgHt    322018.265060
OldTown    123991.891213
SWISU      135071.937500
Sawyer     136751.152318
SawyerW    184070.184000
Somerst    229707.324176
StoneBr    324229.196078
Timber     246599.541667
Veenker    248314.583333
Name: SalePrice, dtype: float64

These numbers match the predictions from our linear regression model exactly. Linear regression simply predicts the average price in each neighborhood. 

To see this mathematically, recall that linear regression minimizes the total squared distance between the observed price and the predicted price:

$$ \text{sum of } (\text{price} - \widehat{\text{price}})^2. $$

After we expand the **Neighborhood** column into 28 dummy variables (e.g., $I\{ \text{Blmngtn} \}$, $I\{ \text{Blueste} \}$, etc.), one for each neighborhood, we can write the predicted price in the linear regression model as 

$$ \widehat{\text{price}} = c_1 I\{ \text{Blmngtn} \} + c_2 I\{ \text{Blueste} \} + \ldots + c_{28} I\{ \text{Veenker} \}. $$

(For simplicity, we have omitted the intercept term $b$.)

Now, consider a house in Bloomington Heights, for which $I\{ \text{Blmngtn} \} = 1$ and all of the other dummy variables $I\{ \text{Blueste} \} = \ldots = I\{ \text{Veenker} \} = 0$. Then, $\widehat{\text{price}}$ for a house in Bloomington Heights is $c_1$. Likewise, $\widehat{\text{price}}$ for a house in Bluestem is $c_2$. And so forth.

Now, we can reframe linear regression as learning the values $c_1, c_2, \ldots, c_{28}$ that minimize

$$ \text{sum of } (\text{price} - \widehat{\text{price}})^2 = \underbrace{\text{sum of } (\text{price} - c_1)^2}_{\text{over houses in Blmngtn}} + \underbrace{\text{sum of } (\text{price} - c_2)^2}_{\text{over houses in Blueste}} + \ldots + \underbrace{\text{sum of } (\text{price} - c_{28})^2}_{\text{over houses in Veenker}}. $$ 

We saw in Chapter 3 that the value of $c$ that mimimizes the $\text{sum of } (\text{price} - c)^2$ is the mean of the prices. So $\hat c_1$ will be the average price of houses in Bloomington Heights, $\hat c_2$ the average price of houses in Bluestem, and so on. Since $\hat c_1, \hat c_2, \ldots, \hat c_{28}$ are also the predicted values for each neighborhood, this shows that linear regression will predict the average label in each category when there is only one categorical variable in the model.

Exercise 1 in this lesson asks you to investigate what $k$-nearest neighbors regression does in the same situation.

## Mixing Quantitative and Categorical Features

In general, we want to fit machine learning models that use a mix of both categorical and quantitative features. In this situation, we will want to apply the `OneHotEncoder` to only the categorical features. Scikit-learn provides a `ColumnTransformer` that allows us to selectively apply transformations to certain columns.

For example, suppose we want to fit a $k$-nearest neighbors model to predict house price from quantitative features (square footage, number of bedrooms, number of full bathrooms) and categorical features (neighborhood, building type). We can use a `ColumnTransformer` to standardize the quantitative features and one-hot encode the categorical features.

In [7]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = make_column_transformer(
    (StandardScaler(), ["Gr Liv Area", "Bedroom AbvGr", "Full Bath"]),
    (OneHotEncoder(), ["Neighborhood", "Bldg Type"]),
    remainder="drop"  # all other columns in X will be dropped.
)
ct

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('standardscaler',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 ['Gr Liv Area', 'Bedroom AbvGr', 'Full Bath']),
                                ('onehotencoder',
                                 OneHotEncoder(categories='auto', drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               sparse=True),
                                 ['Neighborhood', 'Bldg Type'])],
                  verbose=False)

Next, we integrate this `ColumnTransformer` into a pipeline (refer to the previous lesson) with the `KNeighborsRegressor` model.

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor

pipeline = make_pipeline(
    ct,
    KNeighborsRegressor(n_neighbors=10)
)

pipeline.fit(X=df_housing[["Gr Liv Area", "Bedroom AbvGr", "Full Bath",
                           "Neighborhood", "Bldg Type"]], 
             y=df_housing["SalePrice"])

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('standardscaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True),
                                                  ['Gr Liv Area',
                                                   'Bedroom AbvGr',
                                                   'Full Bath']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                         

Now, if we wanted to use this model to predict the price of a 3BR/2BA, 1700 sqft single-family house in Bloomington Heights, we could create a `Series` with this information, and call `pipeline.predict()` on a 2D-array with this single row.

In [9]:
x_test = pd.Series()
x_test["Gr Liv Area"] = 1700
x_test["Bedroom AbvGr"] = 3
x_test["Full Bath"] = 2
x_test["Neighborhood"] = "Blmngtn"
x_test["Bldg Type"] = "1Fam"

pipeline.predict(X=pd.DataFrame([x_test]))

  """Entry point for launching an IPython kernel.


array([251550.])

So this house is predicted to cost $251,550.

## Exercises

1\. Using the Ames data set, build a $10$-nearest neighbors model to predict house price using **Neighborhood** as the only feature. How do the predictions compare with just using the mean house price of each neighborhood? If there are any discrepancies, can you explain why?

2\. In the example from the lesson, we standardized the quantitative features and one-hot encoded the categorical features in parallel. This means that the dummy variables were not standardized before being passed into the $10$-nearest neighbors model. How would you modify the pipeline so that *all* of the variables are standardized?

(_Hint:_ You may find the `remainder="passthrough"` option of `ColumnTransformer` helpful.)

3\. Using the tips data set (tips.csv ), use a $5$-nearest neighbors model to predict how much a male diner will tip on a Sunday bill of \$40.00.