# Feature Preprocessing and Generation
### Categorical and Ordinal features

## Summary

* Values in ordinal features are sorted in some meaningful order.
* Label encoding maps categories to numbers
* Frequency encoding maps categories to their frequencies
* Label and Frequency encodings are often used for tree-based models
* One-hot encoding is often used for non-tree-based models
* Interactions of categorical features can help linear models and KNN

## Label Encoding
![nontree-tree](img/nontree-tree.png)

#### **Alphabetical (sorted)**
  * `sklearn.preprocessing.LabelEncoder`
  * [S,C,Q] -> [2, 1, 3]
  
#### **Order of appearance**
  * `Pandas.factorize`
  * [S,C,Q] -> [1,2,3]
  
#### **Frequency encoding**
  * [S,C,Q] -> [0.5, 0.3, 0.2]
```python
encoding = titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)
```

### Can `frequency encoding` be of help for non-tree based models?
> "Yes. For example, if frequency of category is correlated with target value, linear model will utilize this dependency."


### One more thing about `frequency encoding` ...
> "If you have multiple categories with the `same frequency`, they won't be distinguishable in this new feature. We might apply or run categorization here in order to deal with such ties."

## One-hot encoding

![onehotencoding](img/one-hot-encoding.png)

* works well with linear models/ knn/ neural nets.
* already `scaled` - since minimum of this feature is zero and maximum is one.

If there are too many categories in one feature, many binary columns could be created from one-hot encoding : 
* `We will add too many new binary columns with a few non-zero values.`

### Sparse matrices!
* Instead of allocating space in RAM for every element of an array, we can storeo only onn-zero values and save a lot of memory.
* Often useful when working with categorical features or text data.
* Most of popular libraries can work with these sparse matrices directly: `XGBoost`, `LightGBM`, `sklearn`, 


## Feature generation
* feature interaction among several categorical features
* useful for non-tree based model (linear, knn, neural nets)

### Concatenation with binary columns
![feature-concat](img/feature-concat.png)

* Now linear models can find optimal coefficient for every interaction and improve.
