A particularly common type of feature is the **categorical features**. Also known as discrete features, these are usually **not numeric**. The distinction between categorical features and continuous features is analogous to the distinction between classification and regression,only on the input side rather than the output side.

Regardless of the types of features your data consists of, how you represent them can have an enormous effect on the performance of machine learning models.

The question of how to represent your data best for a particular application is known as **feature engineering**. Representing your data in
the right way can have a bigger influence on the performance of a supervised model than the exact parameters you choose.


## Categorical Variables

The task of the adult dataset is to predict whether a
worker has an income of over $50,000 or under $50,000. The features in this dataset
include the workers’ ages, how they are employed (self employed, private industry
employee, government employee, etc.), their education, their gender, their working
hours per week, occupation, and more

In this dataset, age and hours-per-week are continuous features, which we know
how to treat. The workclass, education, sex, and occupation features are categori‐
cal, however

By far the most common way to represent categorical variables is using the **one-hotencoding** or **one-out-of-N encoding**, also known as **dummy variables.**

The idea behind dummy variables is to **replace a categorical variable with one or more new features that can have the values 0 and 1**. The values 0 and 1 make sense in the formula for linear binary classification (and for all other models in scikit-learn), and we can represent any number of categories by introducing one new feature per category, as described here.

Let’s say for the workclass feature we have possible values of "Government
Employee", "Private Employee", "Self Employed", and "Self Employed Incorpo
rated". To encode these four possible values, we create four new features, called "Government Employee", "Private Employee", "Self Employed", and "Self Employed Incorporated". A feature is 1 if workclass for this person has the corresponding value and 0 otherwise, so exactly one of the four new features will be 1 for each data point. This is why this is called **one-hot or one-out-of-N encoding.**

In [2]:
import pandas as pd
# The file has no headers naming the columns, so we pass header=None
# and provide the column names explicitly in "names"
data = pd.read_csv(
 "data/adult.data", header=None, index_col=False,
 names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
 'marital-status', 'occupation', 'relationship', 'race', 'gender',
 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
 'income'])
# For illustration purposes, we only select some of the columns
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
 'occupation', 'income']]
# IPython.display allows nice output formatting within the Jupyter notebook
display(data.head())


Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K
