# One Hot Encoder
Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

Azure ML Data Prep has the ability to perform one hot encoding on a selected column using `one_hot_encode`. The result Dataflow will have a new binary column for each categorical label encountered in the selected column.

In [1]:
import azureml.dataprep as dprep
dflow = dprep.read_csv(path='../data/crime-spring.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40,13,6,,,2016,5/25/2016 15:59,,,


To use `one_hot_encode` from a Dataflow, simply specify the source column. `one_hot_encode` will figure out all the distinct values or categorical labels in the source column using the current data, and it will return a new Dataflow with a new binary column for each categorical label. Note that the categorical labels are remembered in the Dataflow step.

In [2]:
dflow_result = dflow.one_hot_encode(source_column='Location Description')
dflow_result.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Location Description_RESIDENCE,Location Description_RESTAURANT,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,0,0,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,1,0,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,1,0,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,1,0,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",0,0,...,40,13,6,,,2016,5/25/2016 15:59,,,


By default, all the new columns will use the `source_column` name as a prefix. However, if you would like to specify your own prefix, simply pass a `prefix` string as a second parameter.

In [3]:
dflow_result = dflow.one_hot_encode(source_column='Location Description', prefix='LOCATION_')
dflow_result.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,LOCATION_RESIDENCE,LOCATION_RESTAURANT,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,0,0,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,1,0,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,1,0,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,1,0,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",0,0,...,40,13,6,,,2016,5/25/2016 15:59,,,


To have more control over the categorical labels, create a builder using `dataflow.builders.one_hot_encode`. The builder allows to preview and modify the categorical labels before generating a new Dataflow with the results.

In [4]:
builder = dflow_result.builders.one_hot_encode(source_column='Location Description', prefix='LOCATION_')

To generate the categorical labels, call the `learn` method on the builder object:

In [5]:
builder.learn()

To preview the categorical labels, simply access them through the property `categorical_labels` on the builder object:

In [6]:
builder.categorical_labels

['RESIDENCE', 'RESTAURANT', 'SCHOOL, PUBLIC, BUILDING', 'OTHER']

To modify the generated `categorical_labels`, assign a new value to `categorical_labels` or modify the existing one. The following example adds a missing label not found on the sample data to `categorical_labels`.

In [7]:
builder.categorical_labels.append('TOWNHOUSE')
builder.categorical_labels

['RESIDENCE', 'RESTAURANT', 'SCHOOL, PUBLIC, BUILDING', 'OTHER', 'TOWNHOUSE']

Once the desired results are achieved, call `builder.to_dataflow` to get the new Dataflow with the encoded labels.

In [8]:
dflow = builder.to_dataflow()
dflow

Dataflow
  name: crime-spring
  parent_package_path: None
  steps: [
    Step {
      type: Microsoft.DPrep.GetFilesBlock,
    },
    Step {
      type: Microsoft.DPrep.ParseDelimitedBlock,
    },
    Step {
      type: Microsoft.DPrep.DropColumnsBlock,
    },
    Step {
      type: Microsoft.DPrep.OneHotEncodingBlock,
    },
    Step {
      type: Microsoft.DPrep.OneHotEncodingBlock,
    },
  ]