# A Demo and Comparison of Categorical Methods (Pt. 2)

**Please refer to [part 1](https://github.com/FeatureLabs/categorical-encoding/blob/master/notebooks/categorical-encoding-demo.ipynb) first for a more comprehensive explanation.**

Similar to part 1, we will be following [this categorical encoding guide](https://github.com/FeatureLabs/categorical-encoding/blob/master/notebooks/categorical-encoding-guide.ipynb) and comparing different categorical encoding methods on a dataset.

For this notebook, we will be using [this Kaggle dataset on predicting restaurant visitors](https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/data) for analyzing different categorical encoding approaches. In this repo, we will aim to predict the number of visitors a restaurant will receive on a given date based on information from the reservation websites, the date/location, and details about the restaurant itself.

In [1]:
import featuretools as ft
import utils2
ft.__version__

'0.9.1'

## Load EntitySet

The data is structured in such a way that there are two distinct websites for reservations that are used. Thus, the data is initially stored in separate datasets. For a more detailed explanation behind the data preparation, check out the [utils2.py] file.

In [2]:
es = utils2.load_entityset('./data/')
es

FileNotFoundError: [Errno 2] File b'../data/air_reserve.csv' does not exist: b'../data/air_reserve.csv'

## Visualize Data

The entity `date_info` is not connected to the entityset, but we include it for visualization purposes/to understand the data better.

In [None]:
es.plot()

## Automated Feature Engineering

We apply Featuretools' Deep Feature Synthesis in order to generate our features.

In [None]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity="visit_data",
                                      verbose=True,
                                      drop_contains=['visit_data.visitors'])
feature_defs[:10]

In [None]:
es['store_info'].df['area_name'].describe()

This column is a high cardinality feature because it has 214 unique categorical values, higher than the limit of 15 we set in [our guide](https://github.com/FeatureLabs/categorical-encoding/blob/master/notebooks/categorical-encoding-guide.ipynb).

This means that for certain categorical encoding methods such as one-hot encoding, we may face issues due to our resulting matrix's high dimensionality.

In [None]:
bayesian_results = utils2.bayesian_encoder_results(feature_matrix)

In [None]:
bayesian_results

In [None]:
classic_results = utils2.classic_encoder_results(feature_matrix)

In [None]:
classic_results