# Column Manipulations
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.<br>

Azure ML Data Prep has many methods for manipulating columns, including basic CUD operations and several other more complex manipulations.

This notebook will focus primarily on data-agnostic operations. For all other column manipulation operations, we will link to their specific how-to guide.

## Table of Contents
[ColumnSelector](#ColumnSelector)<br>
[add_column](#add_columns)<br>
[append_columns](#append_columns)<br>
[drop_columns](#drop_columns)<br>
[duplicate_column](#duplicate_column)<br>
[fuzzy_group_column](#fuzzy_group_column)<br>
[keep_columns](#keep_columns)<br>
[map_column](#map_column)<br>
[new_script_column](#new_script_column)<br>
[rename_columns](#rename_columns)<br>


## ColumnSelector
`ColumnSelector` is a Data Prep class that allows us to select columns by name. The idea is to be able to describe columns generally instead of explicitly, using a search term or regex expression, with various options.

Note that a `ColumnSelector` does not represent the columns they match themselves, but the selector of the described columns. Therefore if we use the same `ColumnSelector` on two different dataflows, we may get different results depending on the columns of each dataflow.

Column manipulations that can utilize `ColumnSelector` will be noted in their respective sections in this book.

In [1]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


All parameters to a `ColumnSelector` are shown here for completeness. We will use `keep_columns` in our example, which will keep only the columns in the dataflow that we tell it to keep.

In the below example, we match all columns with the letter 'i'. Because we set `ignore_case` to false and `match_whole_word` to false, then any column that contains 'i' or 'I' will be selected.

In [2]:
from azureml.dataprep import ColumnSelector
column_selector = ColumnSelector(term="i",
                                 use_regex=False,
                                 ignore_case=True,
                                 match_whole_word=False,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

Unnamed: 0,ID,IUCR,Primary Type,Description,Location Description,Domestic,District,Community Area,FBI Code,X Coordinate,Y Coordinate,Latitude,Longitude,Location
0,10140490.0,820.0,THEFT,$500 AND UNDER,STREET,False,16.0,10.0,06,1129230.0,1933315.0,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,460.0,BATTERY,SIMPLE,STREET,True,24.0,1.0,08B,1167370.0,1946271.0,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,True,,53.0,08B,,,,,
3,10139885.0,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,15.0,25.0,05,1141721.0,1907465.0,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,12.0,28.0,07,1168413.0,1901632.0,41.88561,-87.657009,"(41.885610142, -87.657008701)"


If we set `invert` to true, we get the opposite of what we matched earlier.

In [3]:
column_selector = ColumnSelector(term="i",
                                 use_regex=False,
                                 ignore_case=True,
                                 match_whole_word=False,
                                 invert=True)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

Unnamed: 0,Case Number,Date,Block,Arrest,Beat,Ward,Year,Updated On
0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,False,1613.0,41.0,2015.0,07/12/2015 12:42:46 PM
1,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,False,2431.0,49.0,2015.0,07/12/2015 12:42:46 PM
2,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,False,532.0,9.0,2015.0,07/12/2015 12:42:46 PM
3,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,False,1531.0,37.0,2015.0,07/12/2015 12:42:46 PM
4,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,False,1215.0,27.0,2015.0,07/12/2015 12:42:46 PM


If we change the search term to 'I' and set case sensitivity to true, we get only the handful of columns that contain an upper case 'I'.

In [4]:
column_selector = ColumnSelector(term="I",
                                 use_regex=False,
                                 ignore_case=False,
                                 match_whole_word=False,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

Unnamed: 0,ID,IUCR,FBI Code
0,10140490.0,820.0,06
1,10139776.0,460.0,08B
2,10140270.0,486.0,08B
3,10139885.0,610.0,05
4,10140379.0,930.0,07


And if we set `match_whole_word` to true, we get no results at all as there is no column called 'I'.

In [5]:
column_selector = ColumnSelector(term="I",
                                 use_regex=False,
                                 ignore_case=False,
                                 match_whole_word=True,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

Finally, the `use_regex` flag dictates whether or not to treat the search term as a regex. It can be combined still with the other options.

Here we define all columns that begin with the capital letter 'I'.

In [6]:
column_selector = ColumnSelector(term="I.*",
                                 use_regex=True,
                                 ignore_case=True,
                                 match_whole_word=True,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

Unnamed: 0,ID,IUCR
0,10140490.0,820.0
1,10139776.0,460.0
2,10140270.0,486.0
3,10139885.0,610.0
4,10140379.0,930.0


## add_column

Please see [add-column-using-expression](add-column-using-expression.ipynb).

## append_columns

Please see [append-columns-and-rows](append-columns-and-rows.ipynb).

## drop_columns

Data Prep supports dropping columns one or more columns in a single statement. Supports `ColumnSelector`.

In [7]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


Note that there are 22 columns to begin with. We will now drop the 'ID' column and observe that the resulting dataflow contains 21 columns.

In [8]:
dflow_dropped = dflow.drop_columns('ID')
dflow_dropped.head(5)

Unnamed: 0,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,1613.0,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,2431.0,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,532.0,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,1531.0,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,1215.0,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


We can also drop more than one column at once by passing a list of column names.

In [9]:
dflow_dropped = dflow_dropped.drop_columns(['IUCR', 'Description'])
dflow_dropped.head(5)

Unnamed: 0,Case Number,Date,Block,Primary Type,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,THEFT,STREET,False,False,1613.0,16.0,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,BATTERY,STREET,False,True,2431.0,24.0,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,BATTERY,STREET,False,True,532.0,,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,BURGLARY,SMALL RETAIL STORE,False,False,1531.0,15.0,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,MOTOR VEHICLE THEFT,STREET,False,False,1215.0,12.0,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


## duplicate_column

Data Prep supports duplicating columns one or more columns in a single statement.

Duplicated columns are placed to the immediate right of their source column.

In [10]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


We decide which column(s) to duplicate and what the new column name(s) should be with a key value pairing (dictionary).

In [11]:
dflow_dupe = dflow.duplicate_column({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})
dflow_dupe.head(5)

Unnamed: 0,ID,ID2,Case Number,Date,Block,IUCR,IUCR_Clone,Primary Type,Description,Location Description,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,820.0,THEFT,$500 AND UNDER,STREET,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,460.0,BATTERY,SIMPLE,STREET,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


## fuzzy_group_column

Please see [fuzzy-group](fuzzy-group.ipynb).

## keep_columns

Data Prep supports keeping one or more columns in a single statement. The resulting dataflow will contain only the column(s) specified; dropping all the other columns. Supports `ColumnSelector`.

In [12]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


In [13]:
dflow_keep = dflow.keep_columns(['ID', 'Date', 'Description'])
dflow_keep.head(5)

Unnamed: 0,ID,Date,Description
0,10140490.0,07/05/2015 11:50:00 PM,$500 AND UNDER
1,10139776.0,07/05/2015 11:30:00 PM,SIMPLE
2,10140270.0,07/05/2015 11:20:00 PM,DOMESTIC BATTERY SIMPLE
3,10139885.0,07/05/2015 11:19:00 PM,FORCIBLE ENTRY
4,10140379.0,07/05/2015 11:00:00 PM,THEFT/RECOVERY: AUTOMOBILE


Similar to `drop_columns`, we can pass a single column name or a list of them.

In [14]:
dflow_keep = dflow_keep.keep_columns('ID')
dflow_keep.head(5)

Unnamed: 0,ID
0,10140490.0
1,10139776.0
2,10140270.0
3,10139885.0
4,10140379.0


## map_column

Data Prep supports string mapping. For a column containing strings, we can provide specific mappings from an original value to a new value, and then produce a new column that contains the mapped values.

The mapped columns are placed to the immediate right of their source column.

In [15]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


In [16]:
from azureml.dataprep import ReplacementsValue
replacements = [ReplacementsValue('THEFT', 'THEFT2'), ReplacementsValue('BATTERY', 'BATTERY!!!')]
dflow_mapped = dflow.map_column(column='Primary Type', 
                                new_column_id='Primary Type V2',
                                replacements=replacements)
dflow_mapped.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Primary Type V2,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,THEFT2,$500 AND UNDER,STREET,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,BATTERY!!!,SIMPLE,STREET,False,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,BATTERY!!!,DOMESTIC BATTERY SIMPLE,STREET,False,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


## new_script_column

Please see [custom-python-transforms](custom-python-transforms.ipynb).

## rename_columns

Data Prep supports renaming one or more columns in a single statement.

In [17]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


We decide which column(s) to rename and what the new column name(s) should be with a key value pairing (dictionary).

In [18]:
dflow_renamed = dflow.rename_columns({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})
dflow_renamed.head(5)

Unnamed: 0,ID2,Case Number,Date,Block,IUCR_Clone,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"
