# Label Encoder
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.<br>

Data Prep has the ability to encode labels with values between 0 and (number of classes - 1) using `label_encode`.

In [1]:
import azureml.dataprep as dprep
from datetime import datetime
dflow = dprep.read_csv(path='../data/crime-spring.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40,13,6,,,2016,5/25/2016 15:59,,,


To use `label_encode` from a Dataflow, simply specify the source column and the new column name. `label_encode` will figure out all the distinct values or classes in the source column, and it will return a new Dataflow with a new column containing the labels.

In [2]:
dflow = dflow.label_encode(source_column='Primary Type', new_column_name='Primary Type Label')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Primary Type Label,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,0,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,1,FROM BUILDING,RESIDENCE,False,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,0,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,0,FORGERY,RESIDENCE,False,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,1,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,...,40,13,6,,,2016,5/25/2016 15:59,,,


To have more control over the encoded labels, create a builder with `dataflow.builders.label_encode`.
The builder allows you to preview and modify the encoded labels before generating a new Dataflow with the results. 
To get started, create a builder object with `dataflow.builders.label_encode` specifying the source column and the new column name.  

In [3]:
builder = dflow.builders.label_encode(source_column='Location Description', new_column_name='Location Description Label')

To generate the encoded labels, call the `learn` method on the builder object:

In [4]:
builder.learn()

To check the result, access the generated labels through the property `encoded_labels`:

In [5]:
builder.encoded_labels

{'OTHER': 3, 'RESIDENCE': 0, 'RESTAURANT': 1, 'SCHOOL, PUBLIC, BUILDING': 2}

To modify the generated results, just assign a new value to `encoded_labels`. The following example adds a missing label not found in the sample data. `builder.encoded_labels` is saved into a variable `encoded_labels`, modified, and assigned back to `builder.encoded_labels`.

In [6]:
encoded_labels = builder.encoded_labels
encoded_labels['TOWNHOUSE'] = 6

builder.encoded_labels = encoded_labels
builder.encoded_labels

{'OTHER': 3,
 'RESIDENCE': 0,
 'RESTAURANT': 1,
 'SCHOOL, PUBLIC, BUILDING': 2,
 'TOWNHOUSE': 6}

Once the desired results are achieved, call `builder.to_dataflow` to get the new Dataflow with the encoded labels.

In [7]:
dataflow = builder.to_dataflow()
dataflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Primary Type Label,Description,Location Description,Location Description Label,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,0,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,3,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,1,FROM BUILDING,RESIDENCE,0,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,0,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,0,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,0,FORGERY,RESIDENCE,0,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,1,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",2,...,40,13,6,,,2016,5/25/2016 15:59,,,
