# Feature Engineering

#### Categorical, Text, and Image Features
* Data scientists regularly work with categorical, text, and image data. However, to execute machine learning algorithms on these data types, it's necessary to perform transformations first. 
* Categorical data, such as the neighborhood in which a property is located, does not always work well with the machine learning algorithm you're most interested in using. 
* Linear regression, for example, requires numerical inputs.
* Options include one-hot encoding of categorical data and text and image data feature engineering (important for processes like NLP, which has applications in social media and data mining).
* Featuer engineering with images can be very complex: the simplest of which is just using the pixel values themselves
* HOG: Histogram of Oriented Gradients
   
Feature Engineering: understand how best to preprocess and engineer features from categorical, continuous, and unstructured data. 

* **Feature Engineering:** the act of taking raw data and extracting features for machine learning 
* Most machine learning algorithms work with tabular data.
* Most ML algorithms require their imput data to be represented as a vector or a matrix and many assume that the data is distributed normally

* **Different Types of Data:**
    * **Continuous:** either integers (whole numbers) or floats (decimal values)
    * **Categorical:** one of a limited set of values, e.g. gender, country of birth
    * **Ordinal:** ranked values, often with no detail of distance between them
    * **Boolean:** True/False values
    * **Datetime:** dates and times
* in pandas, "objects" are columns that contain strings
* knowing the types of each column can be very useful if you are performing analysis based on a subset of specific data types. To do this, use: `.select_dtypes()` method and pass a list of relevant data types: `only_ints = df.select_dtypes(include=['int'])`

#### Categorical Variables
* Categorical variables are used to represent groups that are qualitative in nature, like colors, country of birth
* You will need to encode categorical values as numeric values to use them in your machine learning models 
* When categories are unordered (like colors, country of birth), assigned ordered numerical values to them may greatly penalize the effectiveness of your model.
* Thus, you cannot allocate arbitrary numbers to each category, as that would imply some form of ordering to the categories
* $\Rightarrow$ **One Hot Encoding**
* $\Rightarrow$ **Dummy Encoding**
    * Very similar, and often confused
    * by default, pandas performs one hot encoding when you use the get_dummies() function
    * difference:
        * **One Hot Encoding:** converts *n* categories into *n* features
            * `pd.get_dummies(df, columns=['Country'], prefix ='C')`
            * note that specifying a prefix argument can improve readability, especially if the list of column names passed to `columns` contains more than one column.
            * **Use for: generally creating more explainable features**
            * **Note: one must be aware that one-hot encoding may create features that are entirely colinear due to the same information being represented multiple times. 
        * **Dummy Encoding:** creates *n* - 1 features for *n* categories
            * `pd.get_dummies(df, columns=['Coutnry'], drop_first=True, prefix = 'C')`
            * the dropped column (referred to as the *base column* is encoded by the absence of all other features and it's value is represented by the intercept
            * **Use for: Necessary information without duplication.**
            
        * Both one-hot encoding and dummy encoding may result in a **huge** number of columns being created if there are too many different categories in a column 
        * In these cases, you may only want to create columns for the most common values:
            * `counts = df['Country'].value_counts()` # to check occurences of a category value 
            * once you have your counts of column category occurences, you can use it to limit what values you will include by first creating a mask of values that occur less than *n* times:
            * `mask = df['Country'].isin(counts[counts<5].index)`
            * use the mask to replace these categories that occur less frequently with a value of your choice (for example: an umbrella category like 'Other')
            * `df['Country'][mask] = 'Other'