# Feature Engineering

#### Categorical, Text, and Image Features
* Data scientists regularly work with categorical, text, and image data. However, to execute machine learning algorithms on these data types, it's necessary to perform transformations first. 
* Categorical data, such as the neighborhood in which a property is located, does not always work well with the machine learning algorithm you're most interested in using. 
* Linear regression, for example, requires numerical inputs.
* Options include one-hot encoding of categorical data and text and image data feature engineering (important for processes like NLP, which has applications in social media and data mining).
* Featuer engineering with images can be very complex: the simplest of which is just using the pixel values themselves
* HOG: Histogram of Oriented Gradients

#### Handling Categorical Data
* Creating dummy features from categorical values makes it so that you can include them in your modeling project by converting a single categorical column into many binary columns indicating the presence and absence of each categorical level (one-hot encoding)
    * **sklearn's Category Encoders package:**
    * **TL;DR;**
        * Use Category Encoders to improve model performance when you have nominal or ordinal data that may provide value.
        * For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot for high cardinality columns and decision tree-based algorithms.
        * For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum, BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason you might want to try them.
        * For regression tasks, Target and LeaveOneOut probably won’t work well.
        
    * We should (arguably) classify data as 1 of 7 types to make better models faster
        * **Useless** — useless for machine learning algorithms, that is — discrete
        * **Nominal** — groups without order — discrete; groups do not overlap
        * **Binary** — either/or — discrete
        * **Ordinal** — groups with order — discrete; natural, ordered categories
        * **Count** — the number of occurrences — discrete
        * **Time** — cyclical numbers with a temporal component — continuous
        * **Interval** — positive and/or negative numbers without a temporal component — continuous
    * **Nominal Data:**
        * Has values that cannot be ordered in any meaningful way
        * mosst often one-hot (dummy) encoded
    * **Ordinal Data:**
        * Ordinal data can be rank-ordered in a meaningful way
        * Can be encoded in one of three ways:
            * 1) It can be assumed to be close enough to interval data — with relatively equal magnitudes between the values — to treat it as such. 
            * 2) It can be treated as nominal data, where each category has no numeric relationship to another. You can try one-hot encoding and other encodings appropriate for nominal data.
            * 3) The magnitude of the difference between the numbers can be ignored. You can just train your model with different encodings and see which encoding works best.