## What is Categorical Data?
Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

* A “pet” variable with the values: “dog” and “cat“.
* A “color” variable with the values: “red“, “green” and “blue“.
* A “place” variable with the values: “first”, “second” and “third“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable.


## What is the Problem with Categorical Data?
Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

## How to Convert Categorical Data to Numerical Data?
This involves two steps:

1. Integer Encoding
2. One-Hot Encoding

### 1. Integer Encoding
As a first step, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.

### 2. One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

In [10]:
using DataFrames
using MLLabelUtils


In [11]:
df = DataFrame(col1=102:104, col2=[["a"], ["a","b"], ["c","b"]])
ux = unique(reduce(vcat, df.col2))
transform(df, :col2 .=> [ByRow(v -> x in v) for x in ux] .=> Symbol.(:col2_, ux))

Unnamed: 0_level_0,col1,col2,col2_a,col2_b,col2_c
Unnamed: 0_level_1,Int64,Array…,Bool,Bool,Bool
1,102,"[""a""]",1,0,0
2,103,"[""a"", ""b""]",1,1,0
3,104,"[""c"", ""b""]",0,1,1


In [12]:
true_targets = Int8[0, 1, 0, 1, 1];
convertlabel([:yes,:no], true_targets)

5-element Vector{Symbol}:
 :no
 :yes
 :no
 :yes
 :yes

In [13]:
df = DataFrame(col1=102:104, col2=["a", "b", "b"])
convertlabel([0,1], df.col2)


3-element Vector{Int64}:
 0
 1
 1

In [14]:
df = DataFrame(col1=102:104, col2=["a", "b", "b"])
convertlabel([:yes,:no], df.col2)


3-element Vector{Symbol}:
 :yes
 :no
 :no