# Working with categorical features and missing data

## Data

We will use the `penguin` data again. This time, we will use it for regression — predict the body mass from the other charaterisitics. We will also show how to work with categorical data.

In [None]:
import seaborn as sns

In [None]:
penguins = sns.load_dataset('penguins')

In [None]:
penguins.head(10)

In [None]:
penguins.shape

### Drop rows with missing outcomes

Actually, there is no missing numeric data once missing outcomes are dropped. For illustration, we will randomly add some missing values.

In [None]:
features = [
    'species', 
    'island',
    'bill_length_mm',
    'bill_depth_mm',
    'flipper_length_mm',
    'sex'
]

In [None]:
target = 'body_mass_g' 

In [None]:
penguins = penguins.dropna(subset=[target])

In [None]:
import numpy as np

In [None]:
idx = np.random.choice(penguins.shape[0], 10)

In [None]:
penguins.loc[idx, 'bill_length_mm'] = None

In [None]:
penguins.shape

### Convert strings to cateogory

In [None]:
cols = penguins.select_dtypes('object').columns.tolist()

# catboost wants missing string to still be strings, not na
penguins[cols] = penguins[cols].fillna('None')

# convert string columns to type 'category'
penguins[cols ] = penguins[cols].astype('category')

## Option 1: Use `catboost`

This is probaly the simplest opiton since it handles cateogrical data, missing data, and different column scales automatically.

In [None]:
X = penguins[features]
y = penguins[target]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from catboost import CatBoostRegressor

In [None]:
cols

In [None]:
cbr = CatBoostRegressor(cat_features = cols, verbose=False)

In [None]:
cbr.fit(X_train, y_train)
cbr.score(X_test, y_test)

## Option 2

- Convert categorical data to one-hot-vector — i..e., a column with 3 categories is converted into 3 columns each of which is either 0 or 1.
- Scale numerical data
- Impute missing data — there are 3 options SimpleImputer, IteratvieImputer, and KNNImputer

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

In [None]:
preprocessor = make_column_transformer(
    (
        OneHotEncoder(sparse_output=False, handle_unknown="ignore"),
        make_column_selector(dtype_include="category"),
    ),
    (
        StandardScaler(),
        make_column_selector(dtype_include="number"),
    )
)

### Pipeline version 1

In [None]:
pipe = make_pipeline(
    preprocessor,
    IterativeImputer(),
    LinearRegression(),
)    

In [None]:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

### Pipeline version 2

In [None]:
pipe = make_pipeline(
    preprocessor,
    SimpleImputer(),
    RandomForestRegressor(),
)  

In [None]:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)