# Turning categorical variables into quantitative variables in Python

Most statistical models cannot take in objects or strings as input and, for model training, only take the numbers as inputs. It is the reason we should do this transformation.

But take care, once we transform categorical to numbers we often get a scala of values like [1,2,3,4,5] (imagine that the original values were ['white, black, blue, grey, yellow]), the model could interprete it as a degree of importance, what is not necessarily true. To avoid misunderstanding we should apply one technique that is is often called “one-hot encoding”.

The function get_dummies (Pandas) will help with that. It will return a column to each label, containing 0 or 1, 1 if that label exist for that row, 0 otherwise.

## 1. Example 

Fuel data

In [23]:
import pandas as pd

dict_fuel = {"registration":[231,232,332,224,335,632,327,856,923], 
                 "type":['gas', 'gas', 'gas','diesel','diesel', 'alcohol', 'alcohol', 'diesel','gas']}

df_fuel = pd.DataFrame.from_dict(dict_fuel)

df_fuel

Unnamed: 0,registration,type
0,231,gas
1,232,gas
2,332,gas
3,224,diesel
4,335,diesel
5,632,alcohol
6,327,alcohol
7,856,diesel
8,923,gas


In [24]:
# Return a Dummy-coded data, which could be concatenate into a original dataframe 
dummies = pd.get_dummies(df_fuel['type'])

# Result
dummies

Unnamed: 0,alcohol,diesel,gas
0,0,0,1
1,0,0,1
2,0,0,1
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,0,1,0
8,0,0,1


In [25]:
df_fuel = pd.concat([df_fuel, dummies], axis=1)

# Result
df_fuel

Unnamed: 0,registration,type,alcohol,diesel,gas
0,231,gas,0,0,1
1,232,gas,0,0,1
2,332,gas,0,0,1
3,224,diesel,0,1,0
4,335,diesel,0,1,0
5,632,alcohol,1,0,0
6,327,alcohol,1,0,0
7,856,diesel,0,1,0
8,923,gas,0,0,1


## 2. Example

Colors data

In [26]:
import pandas as pd

dict_colors = {"registration":[231,232,332,224,335,632,327,856,923], 
                 "color":['white', 'yellow', 'yellow','purple','white', 'purple', 'white', 'black','black']}

df_colors = pd.DataFrame.from_dict(dict_colors)

df_colors

Unnamed: 0,registration,color
0,231,white
1,232,yellow
2,332,yellow
3,224,purple
4,335,white
5,632,purple
6,327,white
7,856,black
8,923,black


In [27]:
# Return a Dummy-coded data, which could be concatenate into a original dataframe 
dummies = pd.get_dummies(df_colors['color'])

# Result
dummies

Unnamed: 0,black,purple,white,yellow
0,0,0,1,0
1,0,0,0,1
2,0,0,0,1
3,0,1,0,0
4,0,0,1,0
5,0,1,0,0
6,0,0,1,0
7,1,0,0,0
8,1,0,0,0


In [28]:
df_colors = pd.concat([df_colors, dummies], axis=1)

# Result
df_colors

Unnamed: 0,registration,color,black,purple,white,yellow
0,231,white,0,0,1,0
1,232,yellow,0,0,0,1
2,332,yellow,0,0,0,1
3,224,purple,0,1,0,0
4,335,white,0,0,1,0
5,632,purple,0,1,0,0
6,327,white,0,0,1,0
7,856,black,1,0,0,0
8,923,black,1,0,0,0


# Conclusion

As shown in the output image, it can be compared with the original image of data frame. If the string exists at that same index, then value is 1 otherwise 0.