## Categorical Variables

So far, we’ve assumed that our data comes in as a two-dimensional array of floating-point numbers, where each column is a continuous feature that describes the data points. For many applications, this is not how the data is collected. A particularly common type of feature is the *categorical features*. Also known as *discrete features*, these are usually not numeric. 

The distinction between categorical features and continuous features is analogous to the distinction between classification and regression, only on the input side rather than the output side. Examples of continuous features that we have seen are pixel brightnesses and size measurements of plant flowers. Examples of categorical features are the brand of a product, the color of a product, or the department (books, clothing, hardware) it is sold in. These are all properties that can describe a product, but they don’t vary in a continuous way. A product belongs either in the clothing department or in the books department. There is no middle ground between books and clothing, and no natural order for the different categories (books is not greater or less than clothing, hardware is not between books and clothing, etc.).

As an example, we will use the dataset of adult incomes in the United States, derived from the 1994 census database. The task of the adult dataset is to predict whether a worker has an income of over $50,000 or under $50,000. The features in this dataset include the workers’ ages, how they are employed (self employed, private industry employee, government employee, etc.), their education, their gender, their working hours per week, occupation, and more. 

The table shows the first few entries in the dataset.

The task is phrased as a classification task with the two classes being income *<=50k* and *>50k*. It would also be possible to predict the exact income, and make this a regression task. However, that would be much more difficult, and the 50K division is interesting to understand on its own.

In this dataset, *age* and *hours-per-week* are continuous features, which we know how to treat. The *workclass*, *education*, *sex*, and *occupation* features are categorical, however. All of them come from a fixed list of possible values, as opposed to a range, and denote a qualitative property, as opposed to a quantity.

As a starting point, let’s say we want to learn a logistic regression classifier on this data. We know from Chapter 2 that a logistic regression makes predictions, ŷ, using the following formula:

    ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b > 0

where w[i] and b are coefficients learned from the training set and x[i] are the input features. This formula makes sense when x[i] are numbers, but not when x[2] is "Masters" or "Bachelors". Clearly we need to represent our data in some different way when applying logistic regression. The next section will explain how we can overcome this problem.