When we have data that is not numerical, we need to modify it to a numeric representation to be able to process our data with the usual methods.

In [32]:
import pandas as pd

In [63]:
target_url = 'https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/datasets/careval.csv'
feature_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'target']
df = pd.read_csv(target_url, names=feature_names)
df[['doors', 'persons']] = df[['doors', 'persons']].apply(pd.to_numeric, errors='coerce')

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1296 non-null float64
persons     1152 non-null float64
lug_boot    1728 non-null object
safety      1728 non-null object
target      1728 non-null object
dtypes: float64(2), object(5)
memory usage: 94.6+ KB


As you can see, there are several rows that are not numeric.

In [68]:
df_dummy = pd.get_dummies(df, drop_first=True)
df_dummy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 15 columns):
doors             1296 non-null float64
persons           1152 non-null float64
buying_low        1728 non-null uint8
buying_med        1728 non-null uint8
buying_vhigh      1728 non-null uint8
maint_low         1728 non-null uint8
maint_med         1728 non-null uint8
maint_vhigh       1728 non-null uint8
lug_boot_med      1728 non-null uint8
lug_boot_small    1728 non-null uint8
safety_low        1728 non-null uint8
safety_med        1728 non-null uint8
target_good       1728 non-null uint8
target_unacc      1728 non-null uint8
target_vgood      1728 non-null uint8
dtypes: float64(2), uint8(13)
memory usage: 49.0 KB


The get_dummies method will split every feature of type object into several columns. If there are, for example, 3 different values for a feature, 3 columns will be created, and the value on every column will be 1 if it corresponds with that label. For a given feature, if two of the new columns are zero, we can conclude that the value is 1 for the other column, so we can just remove that and infer it that way. That's what the "drop_first" parameter do. This is also known as one-hot encoding.

In [70]:
df['buying'] = df['buying'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null category
maint       1728 non-null object
doors       1296 non-null float64
persons     1152 non-null float64
lug_boot    1728 non-null object
safety      1728 non-null object
target      1728 non-null object
dtypes: category(1), float64(2), object(4)
memory usage: 83.0+ KB
