### Categorical encoding

This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model. There are many ways to encode categorical variables for modeling, although the three most common are as follows:



In [17]:
import pandas as pd
import sklearn 
from sklearn import preprocessing

In [10]:
file = '/home/asimbanskota/t81_577_data_science/weekly_materials/week8/data/airlines.csv'

In [14]:
df = pd.read_csv(file)
df.head(2)

Unnamed: 0,IATA_CODE,AIRLINE
0,UA,United Air Lines Inc.
1,AA,American Airlines Inc.


### Ordinal encoding

The easiest way to convert those categorical values into numeric one is to encode them as integers using ordinal encoder. To convert categorical features to such integer codes, we can use the OrdinalEncoder from sklearn. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1).

In [42]:
enc = preprocessing.OrdinalEncoder()
df['ord_encoder'] = enc.fit_transform(df['IATA_CODE'].values.reshape(-1,1))

In [23]:
df.head()

Unnamed: 0,IATA_CODE,AIRLINE,ord_encoder
0,UA,United Air Lines Inc.,10.0
1,AA,American Airlines Inc.,0.0
2,US,US Airways Inc.,11.0
3,F9,Frontier Airlines Inc.,5.0
4,B6,JetBlue Airways,2.0


The problem with ordinal encoding is that an order is implicitly imposed among the categories; learning algorithm will treat an airline with '2' code as something twice in value than that with '1'. In some instances, such ordinal values might be helpful. For example, to encode ratings low, medium, and high as 1,2, 3 might enable the feature to bring additional information to the learning problem.

### Label encoding

Many learning algorithms require that class label or target variable are also encoded as integer values. Though scikit-learn and other machine learning libraries can convert class labels to integers internally, it is a good practice to do that ahead of training. We can convert the class label to numeric values using OrdinalEncoder as above or using a `LabelEncoder` class.

### One-hot-encoding

A one hot encoding is appropriate for categories without any hierarchical relationship. In this method, each category is mapped to a vector 1 and 0 with 1 denoting the presence of the level of the category. The number of vectors depends on the number of categories for features.

In [37]:
one_hot = enc.fit_transform(df['IATA_CODE'].values.reshape(-1,1)).toarray()
pd.DataFrame(one_hot, columns = enc.categories_).head()

Unnamed: 0,AA,AS,B6,DL,EV,F9,HA,MQ,NK,OO,UA,US,VX,WN
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The missing values for categorical features can be specified using handle_unknown='ignore'. The resulting one-hot encoded columns for this feature will be all zeros. One-hot-encoding can be done in Pandas using a single line `get_dummies` method.

In [40]:
pd.get_dummies(df['IATA_CODE']).head()

Unnamed: 0,AA,AS,B6,DL,EV,F9,HA,MQ,NK,OO,UA,US,VX,WN
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,0,0,0,0


There are two possible limitations with one-hot-encoding depending upon the dataset. It produces a dataset with a large number of columns with a high number of levels of category in a feature. For example, if you need to include US zip codes as a feature, you will end up with more than 40,000 features related to zipcodes. Another disadvantage is that it doesn't incorporate any information about the possible interdependence among categories such as saturday and sunday as weekends and other days as weekdays.

### Other encoders

There are several other encoding techniques which try to tackle the disadvantage of one-hot-encoding by either compressing the features into manageable size (binary encoding, hash encoding etc) or by incorporating interdependence of categories (mean embeddings, entity embeddings etc) and target variables. `category-encoders` package provides a set of scikit-learn-style transformers for those different techniques and many others except entity embeddings. The latter is more of a recent development in which a neural network model is used to generate efficient vector ( embeddings) representation of categorical features, in which the the dimensionality of vector can be reduced as well interdependence between levels of categories with respect to target variable is preserved as well.
