# Column Type Transforms
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

When consuming a data set, it is highly useful to know as much as possible about the data. Column types can help you understand more about each column, and enable type-specific transformations later. This provides much more insight than treating all data as strings.

In this notebook, you will learn about:
- [Built-in column types](#types)
- How to:
 - [Convert to long (integer)](#long)
 - [Convert to double (floating point or decimal number)](#double)
 - [Convert to boolean](#boolean)
 - [Convert to datetime](#datetime)
- [How to use `ColumnTypesBuilder` to get suggested column types and convert them](#builder)
- [How to convert column type for multiple columns if types are known](#multiple-columns)

## Set up

In [1]:
import azureml.dataprep as dprep

In [2]:
dflow = dprep.read_csv('../data/crime-winter.csv')
dflow = dflow.keep_columns(['Case Number', 'Date', 'IUCR', 'Arrest', 'Longitude', 'Latitude'])

## <a name="types"></a>Built-in column types

Currently, Data Prep supports the following column types: string, long (integer), double (floating point or decimal number), boolean, and datetime.

In the previous step, a data set was read in as a Dataflow, with only a few interesting columns kept. We will use this Dataflow to explore column types throughout the notebook.

In [3]:
dflow.head(5)

Unnamed: 0,Case Number,Date,IUCR,Arrest,Latitude,Longitude
0,HZ114126,1/10/2016 11:00,610,True,41.95388599,-87.71077048
1,HZ118288,1/10/2016 21:00,1754,False,41.79319349,-87.69622926
2,HZ110730,1/10/2016 11:50,5002,False,41.91705356,-87.73565764
3,HZ110403,1/10/2016 1:30,497,False,41.76039236,-87.68180481
4,HZ110836,1/10/2016 7:30,890,False,41.75068679,-87.61127681


From the first few rows of the Dataflow, you can see that the columns contain different types of data. However, by looking at `dtypes`, you can see that `read_csv()` treats all columns as string columns.

Note that `auto_read_file()` is a data ingestion function that infers column types. Learn more about it [here](./auto-read-file.ipynb).

In [4]:
dflow.dtypes

Latitude                 FieldType.STRING
Arrest                   FieldType.STRING
Date                     FieldType.STRING
Case Number              FieldType.STRING
Longitude                FieldType.STRING
IUCR                     FieldType.STRING

### <a name="long"></a>Converting to long (integer)

Suppose the "IUCR" column should only contain integers. You can call `to_long` to convert the column type of "IUCR" to `FieldType.INTEGER`. If you look at the data profile ([learn more about data profiles](./data-profile.ipynb)), you will see numeric metrics populated for that column such as mean, variance, quantiles, etc. This is helpful for understanding the shape and distribution of numeric data.

In [5]:
dflow_conversion = dflow.to_long('IUCR')
profile = dflow_conversion.get_profile()
profile

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
Case Number,FieldType.STRING,HZ110403,HZ138745,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Date,FieldType.STRING,1/10/2016 11:00,1/10/2016 9:41,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
IUCR,FieldType.INTEGER,460,5002,10.0,0.0,10.0,0.0,0.0,0.0,460.0,478.5,460.0,610.0,715.0,1754.0,5002.0,5002.0,5002.0,1565.0,1696.33,2877530.0,1.17219,-0.479289
Arrest,FieldType.STRING,FALSE,TRUE,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Latitude,FieldType.STRING,,41.95388599,10.0,0.0,10.0,0.0,0.0,2.0,,,,,,,,,,,,,,
Longitude,FieldType.STRING,,-87.73565764,10.0,0.0,10.0,0.0,0.0,2.0,,,,,,,,,,,,,,


### <a name="double"></a>Converting to double (floating point or decimal number)

Suppose the "Latitude" and "Longitude" columns should only contain decimal numbers. You can call `to_double` to convert the column type of "Latitude" and "Longitude" to `FieldType.DECIMAL`. In the data profile, you will see numeric metrics populated for these columns as well. Note that after converting the column types, you can see that there are missing values in these columns. Metrics like this can be helpful for noticing issues with the data set.

In [6]:
dflow_conversion = dflow_conversion.to_number(['Latitude', 'Longitude'])
profile = dflow_conversion.get_profile()
profile

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
Case Number,FieldType.STRING,HZ110403,HZ138745,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Date,FieldType.STRING,1/10/2016 11:00,1/10/2016 9:41,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
IUCR,FieldType.INTEGER,460,5002,10.0,0.0,10.0,0.0,0.0,0.0,460.0,478.5,460.0,610.0,715.0,1754.0,5002.0,5002.0,5002.0,1565.0,1696.33,2877530.0,1.17219,-0.479289
Arrest,FieldType.STRING,FALSE,TRUE,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Latitude,FieldType.DECIMAL,41.7289,41.9539,10.0,2.0,8.0,0.2,0.0,0.0,41.7289,41.7354,41.7289,41.7555,41.7912,41.8772,41.9539,41.9539,41.9539,41.8163,0.0809571,0.00655405,0.582896,-1.40269
Longitude,FieldType.DECIMAL,-87.7357,-87.5707,10.0,2.0,8.0,0.2,0.0,0.0,-87.7357,-87.7282,-87.7357,-87.7071,-87.689,-87.6257,-87.5707,-87.5707,-87.5707,-87.6688,0.056205,0.003159,0.505993,-1.37943


### <a name="boolean"></a>Converting to boolean

Suppose the "Arrest" column should only contain boolean values. You can call `to_bool` to convert the column type of "Arrest" to `FieldType.BOOLEAN`.

The `to_bool` function allows you to specify which values should map to `True` and which values should map to `False`. To do so, you can provide those values in an array as parameters `true_values` and `false_values`. Additionally, you can specify whether all other values should become `True` or `False` by using the `mismatch_as` parameter.

In [7]:
dflow_conversion.to_bool('Arrest', 
                         true_values=[1],
                         false_values=[0],
                         mismatch_as=dprep.MismatchAsOption.ASTRUE).head(5)

Unnamed: 0,Case Number,Date,IUCR,Arrest,Latitude,Longitude
0,HZ114126,1/10/2016 11:00,610,True,41.953886,-87.71077
1,HZ118288,1/10/2016 21:00,1754,True,41.793193,-87.696229
2,HZ110730,1/10/2016 11:50,5002,True,41.917054,-87.735658
3,HZ110403,1/10/2016 1:30,497,True,41.760392,-87.681805
4,HZ110836,1/10/2016 7:30,890,True,41.750687,-87.611277


In the previous conversion, all the values in the "Arrest" column became `True`, because 'FALSE' was not matched to `False`, and all the unmatched values were set to become `True`. Let's try the conversion again with different `false_values`.

In [8]:
dflow_conversion = dflow_conversion.to_bool('Arrest',
                                            true_values=[1, 'TRUE'],
                                            false_values=[0, 'FALSE'],
                                            mismatch_as=dprep.MismatchAsOption.ASTRUE)
dflow_conversion.head(5)

Unnamed: 0,Case Number,Date,IUCR,Arrest,Latitude,Longitude
0,HZ114126,1/10/2016 11:00,610,True,41.953886,-87.71077
1,HZ118288,1/10/2016 21:00,1754,False,41.793193,-87.696229
2,HZ110730,1/10/2016 11:50,5002,False,41.917054,-87.735658
3,HZ110403,1/10/2016 1:30,497,False,41.760392,-87.681805
4,HZ110836,1/10/2016 7:30,890,False,41.750687,-87.611277


This time, all the string values 'FALSE' have been successfully converted to the boolean value `False`. Take another look at the data profile.

In [9]:
profile = dflow_conversion.get_profile()
profile

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
Case Number,FieldType.STRING,HZ110403,HZ138745,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Date,FieldType.STRING,1/10/2016 11:00,1/10/2016 9:41,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
IUCR,FieldType.INTEGER,460,5002,10.0,0.0,10.0,0.0,0.0,0.0,460.0,478.5,460.0,610.0,715.0,1754.0,5002.0,5002.0,5002.0,1565.0,1696.33,2877530.0,1.17219,-0.479289
Arrest,FieldType.BOOLEAN,False,True,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Latitude,FieldType.DECIMAL,41.7289,41.9539,10.0,2.0,8.0,0.2,0.0,0.0,41.7289,41.7354,41.7289,41.7555,41.7912,41.8772,41.9539,41.9539,41.9539,41.8163,0.0809571,0.00655405,0.582896,-1.40269
Longitude,FieldType.DECIMAL,-87.7357,-87.5707,10.0,2.0,8.0,0.2,0.0,0.0,-87.7357,-87.7282,-87.7357,-87.7071,-87.689,-87.6257,-87.5707,-87.5707,-87.5707,-87.6688,0.056205,0.003159,0.505993,-1.37943


### <a name="datetime"></a>Converting to datetime

Suppose the "Date" column should only contain datetime values. You can convert its column type to `FieldType.DateTime` using the `to_datetime` function. Typically, datetime formats can be confusing or inconsistent. Next, we will show you all the tools that can help correctly converting the column to `DateTime`.

In the first example, directly call `to_datetime` with only the column name. Data Prep will inspect the data in this column and learn what format should be used for the conversion.

Note that if there is data in the column that cannot be converted to datetime, an Error value will be created in that cell.

In [10]:
dflow_conversion_date = dflow_conversion.to_datetime('Date')
dflow_conversion_date.head(5)

Unnamed: 0,Case Number,Date,IUCR,Arrest,Latitude,Longitude
0,HZ114126,2016-01-10 11:00:00,610,True,41.953886,-87.71077
1,HZ118288,2016-01-10 21:00:00,1754,False,41.793193,-87.696229
2,HZ110730,2016-01-10 11:50:00,5002,False,41.917054,-87.735658
3,HZ110403,2016-01-10 01:30:00,497,False,41.760392,-87.681805
4,HZ110836,2016-01-10 07:30:00,890,False,41.750687,-87.611277


In this case, we can see that '1/10/2016 11:00' was converted using the format `%m/%d/%Y %H:%M`.

The data in this column is actually somewhat ambiguous. Should the dates be 'October 1' or 'January 10'? The function `to_datetime` determines that both are possible, but defaults to month-first (US format).

If the data was supposed to be day-first, you can customize the conversion.

In [11]:
dflow_alternate_conversion = dflow_conversion.to_datetime('Date', date_time_formats=['%d/%m/%Y %H:%M'])
dflow_alternate_conversion.head(5)

Unnamed: 0,Case Number,Date,IUCR,Arrest,Latitude,Longitude
0,HZ114126,2016-10-01 11:00:00,610,True,41.953886,-87.71077
1,HZ118288,2016-10-01 21:00:00,1754,False,41.793193,-87.696229
2,HZ110730,2016-10-01 11:50:00,5002,False,41.917054,-87.735658
3,HZ110403,2016-10-01 01:30:00,497,False,41.760392,-87.681805
4,HZ110836,2016-10-01 07:30:00,890,False,41.750687,-87.611277


## <a name="builder"></a>Using `ColumnTypesBuilder`

Data Prep can help you automatically detect what are the likely column types.

You can call `dflow.builders.set_column_types()` to get a `ColumnTypesBuilder`. Then, calling `learn()` on it will trigger Data Prep to inspect the data in each column. As a result, you can see the suggested column types for each column (conversion candidates).

In [12]:
builder = dflow.builders.set_column_types()
builder.learn()
builder

Column types conversion candidates:
'Latitude': [FieldType.DECIMAL],
'IUCR': [FieldType.DECIMAL],
'Date': [
    (FieldType.DATE, ['%d/%m/%Y %H:%M']),
    (FieldType.DATE, ['%m/%d/%Y %H:%M'])],
'Case Number': [FieldType.STRING],
'Longitude': [FieldType.DECIMAL],
'Arrest': [FieldType.BOOLEAN]

In this case, Data Prep suggested the correct column types for "Arrest", "Case Number", "Latitude", and "Longitude".

However, for "Date", it has suggested two possible date formats: month-first, or day-first. The ambiguity must be resolved before you complete the conversion. To use the month-first format, you can call `builder.ambiguous_date_conversions_keep_month_day()`. Otherwise, call `builder.ambiguous_date_conversions_keep_day_month()`. Note that if there were multiple datetime columns with ambiguous date conversions, calling one of these functions will apply the resolution to all of them.

If you want to skip all the ambiguous date column conversions instead, you can call: `builder.ambiguous_date_conversions_drop()`

In [13]:
builder.ambiguous_date_conversions_keep_month_day()
builder.conversion_candidates

{'Arrest': [FieldType.BOOLEAN],
 'Case Number': [FieldType.STRING],
 'Date': [(FieldType.DATE, ['%m/%d/%Y %H:%M'])],
 'IUCR': [FieldType.DECIMAL],
 'Latitude': [FieldType.DECIMAL],
 'Longitude': [FieldType.DECIMAL]}

The conversion candidate for "IUCR" is currently `FieldType.DECIMAL`. If you already know that "IUCR" should be integers, you can tweak the builder to change the conversion candidate for that specific column.

In [14]:
builder.conversion_candidates['IUCR'] = dprep.FieldType.INTEGER
builder

Column types conversion candidates:
'Latitude': [FieldType.DECIMAL],
'IUCR': <FieldType.INTEGER: 2>,
'Date': [(FieldType.DATE, ['%m/%d/%Y %H:%M'])],
'Case Number': [FieldType.STRING],
'Longitude': [FieldType.DECIMAL],
'Arrest': [FieldType.BOOLEAN]

Once you are happy with the conversion candidates, you can complete the conversion by calling `builder.to_dataflow()`.

In [15]:
dflow_converion_using_builder = builder.to_dataflow()
dflow_converion_using_builder.head(5)

Unnamed: 0,Case Number,Date,IUCR,Arrest,Latitude,Longitude
0,HZ114126,2016-01-10 11:00:00,610,False,41.953886,-87.71077
1,HZ118288,2016-01-10 21:00:00,1754,False,41.793193,-87.696229
2,HZ110730,2016-01-10 11:50:00,5002,False,41.917054,-87.735658
3,HZ110403,2016-01-10 01:30:00,497,False,41.760392,-87.681805
4,HZ110836,2016-01-10 07:30:00,890,False,41.750687,-87.611277


## <a name="multiple-columns"></a>Convert column types for multiple columns

If you already know the column types, you can simply call `dflow.set_column_types()`. This function allows you to specify multiple columns, and the desired column type for each one. Here's how you can convert all five columns at once.

Note that `set_column_types` only supports a subset of column type conversions. For example, we cannot specify the true/false values for a boolean conversion, so the results of this operation is incorrect for the "Arrest" column.

In [16]:
dflow_conversion_using_set = dflow.set_column_types({
    'IUCR': dprep.FieldType.INTEGER,
    'Latitude': dprep.FieldType.DECIMAL,
    'Longitude': dprep.FieldType.DECIMAL,
    'Arrest': dprep.FieldType.BOOLEAN,
    'Date': (dprep.FieldType.DATE, ['%m/%d/%Y %H:%M']),
})
dflow_conversion_using_set.head(5)

Unnamed: 0,Case Number,Date,IUCR,Arrest,Latitude,Longitude
0,HZ114126,2016-01-10 11:00:00,610,False,41.953886,-87.71077
1,HZ118288,2016-01-10 21:00:00,1754,False,41.793193,-87.696229
2,HZ110730,2016-01-10 11:50:00,5002,False,41.917054,-87.735658
3,HZ110403,2016-01-10 01:30:00,497,False,41.760392,-87.681805
4,HZ110836,2016-01-10 07:30:00,890,False,41.750687,-87.611277
