# Dealing with Categorical Features

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

plt.style.use('seaborn-notebook')

One issue we'd like to resolve is what to do with categorical features, i.e. predictors that represent categories rather than continua. In a Pandas DataFrame, these columns may well have strings or even other objects for values, but they need not. Sometimes integers are used to encode different categories, even when those categories have no natural ordering.

## Dummying - Theory

One very effective way of dealing with categorical variables is to dummy them out. What this involves is making a new column for _each categorical value in the column we're dummying out_.

These new columns will be filled only with 0's and 1's, a 1 representing the presence of the relevant categorical value.

Let's look at a simple example. This is a dataset about Australian possums and you can find it on [Kaggle](https://www.kaggle.com/datasets/abrambeyer/openintro-possum).

In [None]:
possums = pd.read_csv('data/possum.csv')

In [None]:
possums.head()

## Problem Setup and EDA

Let's suppose we want to try to model possum age as a function of some of the other variables. Let's first check our data types:

In [None]:
possums.dtypes

The floats are all usable just as they are. Clearly the `Pop` and `sex` columns we'll need to transform. What about the integers? Let's see what these variables are like:

### `'case'`

The `case` variable just looks like an index that counts up from 1.

In [None]:
case_counts = possums['case'].value_counts()

fig, ax = plt.subplots()
ax.bar(case_counts.index, case_counts);

We'll plan to keep this variable out of our model!

### `'site'`

What about `site`?

In [None]:
site_counts = possums['site'].value_counts()

fig, ax = plt.subplots()
ax.bar(site_counts.index, site_counts);

Well this looks more interesting. But notice that we don't have any reason to think that the numbers of these sites are meaningful *as numbers*. We're going to want to treat this variable  in the same way that we'll treat `Pop` and `sex`, i.e. like a **categorical variable**.

Before we go any further let's also check for null values:

In [None]:
possums.isna().sum()

That's only three missing values. Let's go ahead and drop those rows:

In [None]:
# Drop Nulls


Now what if we wanted to compare this `site` variable with our target `age`? EDA with categorical variables can look a bit different from EDA with continuous variables. Check out [this post](https://medium.com/analytics-vidhya/tutorial-exploratory-data-analysis-eda-with-categorical-variables-6a569a3aea55) from FIS's own Erin Hoffman, for example.

Taking a cue from Erin, we might try a histogram of age *for each value of our categorical `site` variable*:

In [None]:
fig, ax = plt.subplots()

for site in possums_no_nulls['site'].value_counts().index:
    ax.hist(possums_no_nulls[possums_no_nulls['site'] == site]['age'],
            alpha=0.4, label=f'site{site}')
plt.legend()
ax.set_title('Site vs. Age');

That's a little hard to see. Let's break this into two plots:

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

for site in possums_no_nulls['site'].value_counts().index[:4]:
    ax1.hist(possums_no_nulls[possums_no_nulls['site'] == site]['age'],
            alpha=0.4, label=f'site{site}')
ax1.legend()
ax1.set_title('Site vs. Age')

for site in possums_no_nulls['site'].value_counts().index[4:]:
    ax2.hist(possums_no_nulls[possums_no_nulls['site'] == site]['age'],
            alpha=0.4, label=f'site{site}')
ax2.legend()
ax2.set_title('Site vs. Age');

That's a bit more illuminating.

### `sex` and `Pop`

What vales of `Pop` do we have?

In [None]:
# Value Counts


Kaggle tells us that these are all Australian possums, each possum coming either from a population in Victoria (`Pop = 'Vic'`) or from a population either in New South Wales or in Queensland (`Pop = 'other'`).

Let's also see what this looks like when we compare it to `age`. This time we'll try a swarmplot from `seaborn`:

In [None]:
sns.swarmplot(x=possums_no_nulls['Pop'], y=possums_no_nulls['age']);

What about the `sex` variable?

In [None]:
sns.swarmplot(x=possums_no_nulls['sex'], y=possums_no_nulls['age']);

OK, good. Let's get to the dummy-coding.

## Dummying - Code

### `pandas.get_dummies()`

Let's use `pd.get_dummies()` on our variables:

In [None]:
# Get Dummies


The last four columns show the action of the dummying-out. Notice that `get_dummies()` selects the object columns by default. If we want to dummy out the `site` variable as well, we'll need to ask for that explicitly:

In [None]:
# Dummy up Site


Before we add these dummies to our data let's scale our numerical variables:

In [None]:
# Create Numeric Dataframe


In [None]:
# Standard Scaler!


In [None]:
# Scaled, numeric dataframe


Now let's add the dummies to our `DataFrame`:

In [None]:
# object categoricals


In [None]:
# Concat it all together


## Digression: `sklearn.preprocessing.OneHotEncoder`

The `get_dummies()` function is useful for EDA, but when you're building machine learning models and pipelines in Phase 3, it will be important to do any one-hot encoding by using `sklearn`'s tool, the `OneHotEncoder`. The main advantage of this is that it stores information about the columns and creates a persistent function that can be used on future data of the same form. This idea of transforming "future data of the same form" is central to  the predictive statistical work we'll do in later phases. See [this page](https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-are-the-pros-and-cons) for more.

Let's try using this tool now. We can compare and contrast its functionality with `get_dummies()`.

In [None]:
# OHE!


In [None]:
# Encoded


Notice that by default the `.transform()` method returns a **sparse matrix**. If we want to see the 1's and 0's we can either override this by setting `sparse=False` in the encoder instance or we can call `todense()` on the sparse matrix:

In [None]:
ohe2 = OneHotEncoder(sparse=False)
ohe2.fit(possum_cats)
possums_encoded2 = ohe2.transform(possum_cats)
possums_encoded2

In [None]:
possums_encoded.todense()

We can also make a `DataFrame` and use the feature names saved in the fit-call as our column headers:

In [None]:
# Pandas, pandas everywhere


To cut down on **multicollinearity** among our predictors, in practice we'll not use *all* of the categories for a given variable but rather leave one out. Note that we can do this *without loss of any information*: Take the `sex` column above: If we remove the `x2_m` column we could reproduce it from the values of `x2_f`, since we know that non-female possums (`x2_f=0`) must be male (`x2_m=1`) and that female possums (`x2_f=1`) cannot be male (`x2_m=0`).

This streamlining is easily done with the `sklearn` tool:

In [None]:
ohe3 = OneHotEncoder(drop='first')
ohe3.fit(possum_cats)
possums_encoded = pd.DataFrame(ohe3.transform(possum_cats).todense(),
                               columns=ohe3.get_feature_names())
possums_encoded

In [None]:
X2 = pd.concat([X_nums_scaled, possums_encoded], axis=1)
X2

## `OrdinalEncoder`

Occasionally we want a coding for our categories that preserves some intuitive *ordering* of those categories.

Suppose we had the results of some survey as our dataset, where answers to questions are of the form:

- Strongly Disagree
- Disagree
- Neutral
- Agree
- Strongly Agree

In this case we'd be throwing away information if we used the one-hot strategy. So we might try an encoding like:

<table>
    <tr>
        <th>Category</th>
        <th>Code</th>
    </tr>
    <tr>
        <td>Strongly Disagree</td>
        <td>0</td>
    </tr>
    <tr>
        <td>Disagree</td>
        <td>1</td>
    </tr>
    <tr>
        <td>Neutral</td>
        <td>2</td>
    </tr>
    <tr>
        <td>Agree</td>
        <td>3</td>
    </tr>
    <tr>
        <td>Strongly Agree</td>
        <td>4</td>
    </tr>
</table>

To effect such a strategy we can use `sklearn.preprocessing.OrdinalEncoder`:

In [None]:
survey_results = (3 * ['Strongly Disagree'])
survey_results.extend(3 * ['Disagree'])
survey_results.extend(3 * ['Neutral'])
survey_results.extend(3 * ['Agree'])
survey_results.extend(3 * ['Strongly Agree'])

np.random.seed(42)
np.random.shuffle(survey_results)
survey_preds = pd.DataFrame(survey_results)
survey_preds

In [None]:
categories = [['Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree']]
ords = OrdinalEncoder(categories=categories)
ords.fit(survey_preds)
ords.transform(survey_preds)

## Modeling

Now let's throw our data into a linear regression model.

In [None]:
y = possums_no_nulls['age']

In [None]:
X2_with_const = sm.add_constant(X2)

In [None]:
sm.OLS(y, X2_with_const).fit().summary()

Notice how the model now includes parameters for our dummies! But here's a question: How do we **interpret** them?

In the case of skull width (`skullw`), we have a beta of 0.2891, and that means that we can expect a possum's age to grow by 0.2891 years if we increase its skull width by 1 mm.

But take the beta for `x1_other`. The value there is -1.6976. How can we understand this? This value encodes the difference we can expect in our target (age, here) when we *increase the variable by one unit*. But for this variabe, "increasing it by one unit" means going from `x1_other=0` to `x1_other=1`, and *that* means going from a possum from the Victoria population to a possum from either the New South Wales or the Queensand population. So it's critical always to keep in mind when interpreting the coefficients of categorical variables in a linear regression model that they must be interpreted against a **baseline**, which is where the values of the inputs are 0. Notice that, for the same reason, this also affects the interpretation of the intercept term.

For more on the interpretation of regression coefficients for categorical variables, see [Erin's repo](https://github.com/hoffm386/coefficients-of-dropped-categorical-variables).

## Exercise

Go back to the variable `X` that has *all* the categorical columns (NOT `X2`) and try building a regression model based on dropping *other* categories than what we just dropped. For example, what happens if we drop the "female" rather than the "male" column, or if we drop the "other population" column rather than the "Victoria population" column?

<details>
    <summary>One answer here</summary>
<code>preds = X.drop(['Pop_other', 'sex_m', 1], axis=1)
preds_const = sm.add_constant(preds)
sm.OLS(y, preds_const).fit().summary()</code>
</details>