In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from IPython.display import display
%matplotlib inline

# Chapter 4. Representing Data and Engineering Features

So far, we’ve assumed that our data comes in as a two-dimensional array of floating-point numbers, where each column is a [continuous feature](https://www.mathsisfun.com/data/data-discrete-continuous.html) that describes the data points.  
For many applications, this is not how the data is collected.  
A particularly common type of feature is the *categorical features*.  
Also known as [discrete features](https://www.mathsisfun.com/data/data-discrete-continuous.html), these are usually not numeric.  
The distinction between categorical features and continuous features is analogous to the distinction between classification and regression, only on the input side rather than the output side.  
Examples of continuous features that we have seen are pixel brightnesses and size measurements of plant flowers.  
Examples of categorical features are the brand of a product, the color of a product, or the department (books, clothing, hardware) it is sold in.  
These are all properties that can describe a product, but they don’t vary in a continuous way.  
A product belongs either in the clothing department or in the books department.  
There is no middle ground between books and clothing, and no natural order for the different categories (books is not greater or less than clothing, hardware is not between books and clothing, etc.).

Regardless of the types of features your data consists of, how you represent them can have an enormous effect on the performance of machine learning models.  
We saw in Chapters 2 and 3 that scaling of the data is important.  
In other words, if you don’t rescale your data (say, to unit variance), then it makes a difference whether you represent a measurement in centimeters or inches.  
We also saw in Chapter 2 that it can be helpful to *augment* your data with additional features, like adding interactions (products) of features or more general polynomials.

The question of how to represent your data best for a particular application is known as feature engineering, and it is one of the main tasks of data scientists and machine learning practitioners trying to solve real-world problems.  
Representing your data in the right way can have a bigger influence on the performance of a supervised model than the exact parameters you choose.  
In this chapter, we will first go over the important and very common case of categorical features, and then give some examples of helpful transformations for specific combinations of features and models.

## Categorical Variables