## Feature engineering - categorical features

While it is ok to have your target vector in classification represent categories as strings for example, categorical features need to be transformed in _scikit-learn_.

There are two major types:
1. Ordinal use `OrdinalEncoder`
2. Nominal use `OneHotEncoder` (can handle non-string features, `DictVectorizer()` cannot)

**Question:** What is the difference between ordinal and nominal features? Can you give examples?

**Answer:** ...

In [None]:
import pandas as pd

In [None]:
data = pd.DataFrame([
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
])

In [None]:
data

### OrdinalEncoder
**IMPORTANT:** Neighborhoods are _nominal_ and should not be encoded with an OrdinalEncoder. We do this here only for comparison to One-Hot Encoder.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
enc.fit(data[['neighborhood']])

In [None]:
enc.categories_

In [None]:
enc.transform(data[['neighborhood']])

In [None]:
enc.inverse_transform([[1]])

### OneHotEncoder 

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)
enc.fit(data[['neighborhood']])

In [None]:
enc.get_feature_names_out()

In [None]:
enc.transform(data[['neighborhood']])

In [None]:
data_enc = pd.DataFrame(enc.transform(data[['neighborhood']]), columns=enc.get_feature_names_out())
data_enc

**Question:** Can you add `data_enc` to the original `data`, dropping the `'neighborhood'` column?

In [None]:
data.head()

In [None]:
# Answer here


In practice, we would use `ColumnTransformer` in a pipeline. We will see this later.

## One-hot encoding (cont'd)

We have already seen how one-hot encoding transforms a categorical feature column into multiple output columns containing 0's and 1's 

In [None]:
# create a DataFrame with an integer feature and a categorical string feature
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1],
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
display(demo_df)

In [None]:
from sklearn.preprocessing import OneHotEncoder
# Setting sparse=False means OneHotEncode will return a numpy array, not a sparse matrix
ohe = OneHotEncoder(sparse_output=False)
print(ohe.fit_transform(demo_df))

In [None]:
print(ohe.get_feature_names_out())

### One-hot encoding a large dataset

As an example, we will use the dataset of adult incomes in the United States, derived from the 1994 census database. The task of the adult dataset is to predict whether a worker has an income of over \\$50,000 or under \\$50,000. The features in this dataset include the workers’ ages, how they are employed (self employed, private industry employee, government employee, etc.), their education, their gender, their working hours per week, occupation, and more...

In [None]:
import mglearn
import os
# The file has no headers naming the columns, so we pass header=None
# and provide the column names explicitly in "names"
adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data")
data = pd.read_csv(
    adult_path, 
    header=None, 
    index_col=False,
    skipinitialspace=True, #remove space after comma
    names=['age', 'workclass', 'fnlwgt', 'education',  'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'gender',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
           'income'])
# For illustration purposes, we only select some of the columns
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]
# IPython.display allows nice output formatting within the Jupyter notebook
display(data.head())

In this dataset, __age__ and __hours-per-week__ are continuous features, which we know how to treat. The __workclass, education, sex, and occupation features__ are categorical, however. All of them come from a fixed list of possible values, as opposed to a range, and denote a qualitative property, as opposed to a quantity.

In [None]:
data.info()

In [None]:
ohe = OneHotEncoder(sparse_output=False)
print(ohe.fit_transform(data['education'].values.reshape(-1,1)).shape)

One column goes in, 16 columns come out, meaning that the education column has 16 discrete values

In [None]:
print(ohe.get_feature_names_out(input_features=['education']))