# Categorial Variables with Linear and Nonlinear Models

Three types of variables are used, particularly in linear regression.

1. Continuous - ordered and can be subdivided.
2. Categorical - limited and fixed number of values.
3. Ordinal - limited and fixed number of values for which order is important.

Categorical variables are sometimes referred to as *levels* (particularly in R). They are often called [*dummy variables*](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) in the statistics literature and indicate membership in a category. But see below.

The actual numbers for categorical variables do not matter (or do they...).

## Differences in Languages (Packages)

We have to encode variables for use in `scikit-learn` as the package only takes numeric categories. `R` takes "levels" and encodes internally. So you can pass strings as categories to models in `R`, but not in `scikit-learn`.

## Transforming Variables in Pandas

In [1]:
import numpy as np
import pandas as pd

from random import choice

In [12]:
df = pd.DataFrame(np.random.randn(25, 2), columns=['a', 'b'])
df['e'] = [choice(('Chicago', 'Boston', 'New York', "RALEIGH", "Raleigh", "NYC")) for i in range(df.shape[0])]
df.head(6)

Unnamed: 0,a,b,e
0,-0.104917,0.556912,Chicago
1,-0.221517,-0.166883,NYC
2,-0.378367,1.180719,NYC
3,-0.795186,0.502131,New York
4,-0.462202,-0.66146,Boston
5,0.813214,1.007439,Boston


`scikit-learn` has the classes [`OneHotEncoder`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and [`LabelEncoder`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) in `sklearn.preprocessing`. `OneHotEncoder` encodes columns that have only integers. `LabelEncoder` encodes strings. Note that there are some complications when using these classes so read the documentation and examples.

Here is another way with Pandas. Use it as a preprocessing step before using `scikit-learn` models.

In [13]:
df1 = pd.get_dummies(df, prefix=["e"])
df1.head(6)

Unnamed: 0,a,b,e_Boston,e_Chicago,e_NYC,e_New York,e_RALEIGH,e_Raleigh
0,-0.104917,0.556912,0,1,0,0,0,0
1,-0.221517,-0.166883,0,0,1,0,0,0
2,-0.378367,1.180719,0,0,1,0,0,0
3,-0.795186,0.502131,0,0,0,1,0,0
4,-0.462202,-0.66146,1,0,0,0,0,0
5,0.813214,1.007439,1,0,0,0,0,0


Pandas did what we asked, but unfortunately, some of our variables were not encoded properly. We should have looked at the data first. There are upper/lower case problems, an embedded space, and 'NYC' is encoded 'New York' as well.

In [15]:
df.groupby(by="e").count()

Unnamed: 0_level_0,a,b
e,Unnamed: 1_level_1,Unnamed: 2_level_1
Boston,4,4
Chicago,5,5
NYC,6,6
New York,4,4
RALEIGH,3,3
Raleigh,3,3


Clean it up, then use ``get_dummies`` again.

Examine your categorial variables to avoid creating redundant features.

What should you do with NA in the categorial field? Create a new column which encodes whether a variable is missing or not. This is particularly important for nonlinear models, such as Random Forests.

## Categorical Variables and Linear Models

In models such as regression, which are linear in the unknown variables, we can't include a variable for all categories. If we have four dummy variables for quarter, we must only include three of them.

- 1 Quarter = 1 or 0
- 2 Quarter = 1 or 0
- 3 Quarter = 1 or 0
- 4 Quarter = 1 or 0

Encoding zero for quarter 1 - 3 is the same as encoding a 1 for quarter 4 and the rest zero. Remember, linear regression has an *intercept*, which is why we can't use four dummy variables.

The dummy variable that is omitted is the base category against which all others are compared.

For instance

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$ln(\tt{wage}) = \alpha + \beta \tt{college} + \tt{error}$

Here we set up a regression problem with `college` as one or zero. If `college` is one, then the wage goes up (assuming $\beta > 0$). So we can interpret the significance of the `college` variable as measuring a wage premium/discount for attending college.

The `error` term contains

1. Every variable not included in the regression model.
2. Randomness.

This model is not *predictive* but is *explanatory*.

Questions

1. Why am I taking the natural log of `wage`?
2. What happens if $\beta < 0$?
3. We encoded `college` as zero or one. Why not one and two?

Including all dummy variables in a regression model introduces [*multicollinearity*](https://en.wikipedia.org/wiki/Multicollinearity) and can cause all kinds of problems for predictions. It *may* be ok for an explantory model, if appropriate corrections are made.

Logistic regression has the same problem.

Note you can remove the intercept, but this may introduce complications in your modeling. For instance, in ``R``, the intercept is included whenever you have a formula like ``y ~ x`` for use in ``lm``, so you **must** use ``y ~ x - 1``. It is easy to forget this.

## Nonlinear Models

When you fit a nonlinear model, such as a Random Forest, you must include all categories. There is no concept of an intercept term.