# Transforming Data and Dealing with Categorical Variables

### Encoding Categorical Variables, Normalizing Variables, Incorporating Interaction and Polynomial Terms, ETC


Today's focus is all about translating raw **data** into useful **information** that a model can understand and properly use. 

## Set Up

New dataset for today! Insurance costs

My source: https://www.kaggle.com/mirichoi0218/insurance (they got the idea for cleaning up the original open source data from [Machine Learning with R](https://www.packtpub.com/product/machine-learning-with-r-third-edition/9781788295864))

In [None]:
# Initial imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
df = pd.read_csv('data/insurance.csv')

In [None]:
# explore the data
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# visualize relationships between numeric columns
sns.pairplot(df)
plt.show()

In [None]:
# visualize correlations between numeric columns
sns.heatmap(df.corr(), annot=True)
plt.show()

### Initial Model!

Let's run a kitchen sink model! Ignoring categorical columns, let's just throw all of our numeric columns into a model and see how we do.

In [None]:
# set our X and y
# ignore our categorical columns for now
used_cols = None
X = None
y = None

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

In [None]:
# scale our data

# fit scaler on train data

# transform both train and test data


In [None]:
# now, let's model!
lr_base = None

# grab predictions for train and test set
train_preds = None
test_preds = None

In [None]:
# evaluate
print(f"Train R2 Score: {r2_score(y_train, train_preds):.3f}")
print(f"Test R2 Score: {r2_score(y_test, test_preds):.3f}")

In [None]:
# let's look at our residuals
# for our full model
plt.scatter(train_preds, y_train-train_preds, label='Train')
plt.scatter(test_preds, y_test-test_preds, label='Test')

plt.axhline(y=0, color = 'red', label = '0')
plt.xlabel('predictions')
plt.ylabel('residuals')
plt.legend()
plt.show()

### Evaluate: How'd we do? What do you notice?

- 


## Encoding Categorical Variables

How do we bring in those categorical columns? By **encoding** them - translating the string variables into useful numbers the model can hopefully understand and take meaning from.

### Most Common Encoding Method: One Hot Encoding (OHE)

One very effective way of dealing with categorical variables is to dummy them out, a process also known as One Hot Encoding. What this involves is making a new column for _each categorical value_ in the column we're dummying out.

These new columns are turned into binaries, with a 1 representing the presence of the relevant categorical value.


For an example in our data: we have a column called `region`:

In [None]:
df['region'].value_counts()

In [None]:
df['region'].head()

With OHE, the result will either be three or four new columns: `is_southeast`, `is_northwest`, `is_southwest`, `is_northeast`

For the head of this data:

| `is_southeast` | `is_northwest` | `is_southwest` | `is_northeast` |
| -------------- | -------------- | -------------- | -------------- | 
| 0 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 0 | 0 |

Why could this result in three columns instead of four? We often drop the first column, and allow the model to capture that value by having zeros in all other columns. This reduces the **multicollinearity** between these newly created columns.

We'll explore two methods to one hot encode our features: a pandas method, which is easy but does not allow for the same transformation on both train and test, and a sklearn method, which learns the patterns and can then later transform test in the same way.

#### With Pandas' `get_dummies()`

In [None]:
# Let's look at the columns we have
df.columns

In [None]:
# Let's define a list of our categorical columns
cat_cols = None

In [None]:
# And a list of our full X columns
# Fun trick!
x_cols = [*used_cols, *cat_cols]

In [None]:
# one hot encode variables
df_ohe = pd.get_dummies(df[x_cols], # note that we run it on all X cols
                        columns=cat_cols, # but only encode cat cols
                        drop_first=True) # drop first for multicollinearity
print(df_ohe.shape)
df_ohe.head()

#### With `sklearn`'s One Hot Encoder

In [None]:
# Let's import two things from SKLearn that will make encoding these columns easy
from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

In [None]:
# Re-define our X value to include ALL columns
X = df[x_cols]
X.head()

In [None]:
# New train-test split for the new X
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

In [None]:
# create an encoder object. This will help us to convert
# categorical variables to new columns
encoder = OneHotEncoder(handle_unknown='error',
                        drop='first',
                        categories='auto')

# Create an columntransformer object.
# This will help us to merge transformed columns
# with the rest of the dataset.
ct = ColumnTransformer(transformers=[('ohe', encoder, cat_cols)],
                       remainder='passthrough')
ct.fit(X_train)
X_train_enc = ct.transform(X_train)
X_test_enc = ct.transform(X_test)

In [None]:
# can display as a dataframe like so
pd.DataFrame(X_train_enc, columns= ct.get_feature_names()).head()

In [None]:
# scale our data - now let's use a Min Max Scaler because binaries!
scaler = MinMaxScaler()

# train on train data
scaler.fit(X_train_enc)

# transform both train and test data
X_train_scaled = scaler.transform(X_train_enc)
X_test_scaled = scaler.transform(X_test_enc)

In [None]:
# now, let's model!
lr_ohe = None

# grab predictions for train and test set
train_preds = None
test_preds = None

In [None]:
# evaluate
print(f"Train R2 Score: {r2_score(y_train, train_preds):.3f}")
print(f"Test R2 Score: {r2_score(y_test, test_preds):.3f}")

In [None]:
# visualize residuals, for the model that now has cat cols
plt.scatter(train_preds, y_train-train_preds, label='Train')
plt.scatter(test_preds, y_test-test_preds, label='Test')

plt.axhline(y=0, color = 'red', label = '0')
plt.xlabel('predictions')
plt.ylabel('residuals')
plt.legend()
plt.show()

### Evaluate: Thoughts?

- 


#### Some Pros and Cons of OHE:

Pros:

- Simple to understand
- Easy to implement

Cons:

- If the categorical column has many options, or there are a lot of categorical columns, you can add _a lot_ more columns - **curse of dimensionality**
- Resulting columns are very sparse (mostly zeros)
- Resulting columns are directly related (multicollinear)

Also - how do we interpret these coefficients?

In [None]:
# Look at the coefs from our sklearn model


Notice how the model now includes parameters for our dummies! But here's a question: How do we **interpret** them?

In the case of `age`, we have a beta of 3643.065, and that means that we can expect the insurance charges to grow by 3643.065 if we increase the person's age by one unit (here, one percent because we min-max scaled our X variables).

But take the beta for `ohe__x1_yes` - where x1 means 'smoker'. The value there is 9546.251. How can we understand this? 

This value encodes the difference we can expect in our target (charges) when we *increase the variable by one unit*. But for this variable, "increasing it by one unit" means going from `ohe__x1_yes=0` to `ohe__x1_yes=1`, and *that* means going from a person who doesn't smoke to one who does! So it's critical always to keep in mind when interpreting the coefficients of categorical variables in a linear regression model that they must be interpreted against a **baseline**, which is where the values of the inputs are 0. Notice that, for the same reason, this also affects the interpretation of the intercept term.

More practice/resources: https://github.com/hoffm386/coefficients-of-dropped-categorical-variables

### Other Encoding Methods?

Certainly there are other ways to turn a categorical column into numeric data that a model can understand.

Some Examples:

- [Label/Ordinal Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
- [Frequency Encoding](https://contrib.scikit-learn.org/category_encoders/count.html) (just a count encoder with `normalize=True` to turn into a frequency percentage)
- [Target Encoding](https://contrib.scikit-learn.org/category_encoders/targetencoder.html) (or, relatedly, [Leave-One-Out Encoding](https://contrib.scikit-learn.org/category_encoders/leaveoneout.html) or [Weight of Evidence Encoding](https://contrib.scikit-learn.org/category_encoders/woe.html))

Useful links:

- [Category Encoders](https://contrib.scikit-learn.org/category_encoders/index.html) - library of sklearn-style encoders that implement more encoding methods than those actually packaged in Sklearn
- [Sklearn's Preprocessing Section](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) - user guide section on preprocessing (includes scalers and transformers as well as encoders)

## Distribution Transformations - AKA Normalizing Variables

### Log Scaling

Linear regression can work better if the predictor and target are normally distributed. 

**Log-scaling** can be a good tool to make *right-skewed* data more normal.

(For *left-skewed* data, which is rarer, we can try transforming our data by raising it to an exponent greater than 1.)

Suppose e.g. a kde plot of my predictor $X$ looks like this:

![original](images/skewplot.png)

In that case, the kde plot of a log-transformed version of $X$ could look like this:

![log](images/logplot.png)

Let's look at our `y` value - how is it distributed?

In [None]:
y.hist();

That's pretty skewed! Let's see what transforming it would look like.

Log transformation using numpy - [documentation](https://numpy.org/doc/stable/reference/generated/numpy.log.html)

In [None]:
np.log(y).hist();

Much more normal!

Let's log both our train and test y, then see if our model improves.

In [None]:
y_train_log = None
y_test_log = None

In [None]:
# now, let's model!
lr_log = None

# grab predictions for train and test set
train_preds = None
test_preds = None

In [None]:
# evaluate
print(f"Train R2 Score: {r2_score(y_train_log, train_preds):.3f}")
print(f"Test R2 Score: {r2_score(y_test_log, test_preds):.3f}")

In [None]:
# visualize residuals, for the model of log-y
plt.scatter(train_preds, y_train_log-train_preds, label='Train')
plt.scatter(test_preds, y_test_log-test_preds, label='Test')

plt.axhline(y=0, color = 'red', label = '0')
plt.xlabel('predictions')
plt.ylabel('residuals')
plt.legend()
plt.show()

### Evaluate - Thoughts?

- 


### Interpreting after Log Transformations

But with this transformed target, how do I now interpret my LR coefficients?


In [None]:
# Look at coefs for our ohe model
dict(zip(ct.get_feature_names(), lr_ohe.coef_))

In [None]:
# Now for coefs of our log model
dict(zip(ct.get_feature_names(), lr_log.coef_))

Before the transformation, I would have said that a one-unit increase (and note - units are standard deviations because we scaled our X inputs!) in `age` results on average in a 3643.065 increase in `charges`. 

But what I need to say now is that a one-unit increase in `age` results on average in a 0.486 increase *in the logarithm of price*, i.e. an increase in price by a factor of $e^{0.486}$.

More practically, you can interpret the exponent as a percentage! If you take the exponent of the coefficient minus one, that gives you the percentage increase.

Formula:

$e ^ \text{coef} - 1$

In code:
```
(np.exp(coef) - 1) * 100
```



In [None]:
# For example:
log_coef_dict = dict(zip(ct.get_feature_names(), lr_log.coef_))

for feature, coef in log_coef_dict.items():
    print(f"A One-Unit Increase in {feature} results on average in a {(np.exp(coef) - 1) * 100}% change in charges")

Note that our binary columns start to get really weird. In practice, before interpreting variables in any practical sense, we'd run one last model without scaling to allow us to better interpret our results - but we'll still likely keep our logged `y` as our target if it improves our model!

Reference:
- https://stats.oarc.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-terms-of-percent-change-in-linear-regression/