# Filtering
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

Azure ML Data Prep has the ability to filter out columns or rows using `Dataflow.drop_columns` or `Dataflow.filter`.

In [1]:
# initial set up
import azureml.dataprep as dprep
from datetime import datetime
dflow = dprep.read_csv(path='../data/crime-spring.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40,13,6,,,2016,5/25/2016 15:59,,,


## Filtering columns

To filter columns, use `Dataflow.drop_columns`. This method takes a list of columns to drop or a more complex argument called `ColumnSelector`.

### Filtering columns with list of strings

In this example, `drop_columns` takes a list of strings. Each string should exactly match the desired column to drop.

In [2]:
dflow = dflow.drop_columns(['ID', 'Location Description', 'Ward', 'Community Area', 'FBI Code'])
dflow.head(5)

Unnamed: 0,Case Number,Date,Block,IUCR,Primary Type,Description,Arrest,Domestic,Beat,District,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,False,False,531,5,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,False,False,614,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,False,False,2211,22,,,2016,5/12/2016 15:50,,,
3,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,False,False,531,5,,,2016,5/13/2016 15:51,,,
4,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,False,False,1712,17,,,2016,5/25/2016 15:59,,,


### Filtering columns with regex

Alternatively, a `ColumnSelector` can be used to drop columns that match a regex expression. In this example, we drop all the columns that match the expression `Column*|.*longitud|.*latitude`.

In [3]:
dflow = dflow.drop_columns(dprep.ColumnSelector('Column*|.*longitud|.*latitude', True, True))
dflow.head(5)

Unnamed: 0,Case Number,Date,Block,IUCR,Primary Type,Description,Arrest,Domestic,Beat,District,X Coordinate,Y Coordinate,Year,Updated On,Location
0,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,False,False,531,5,1183356.0,1831503.0,2016,5/11/2016 15:48,"(41.692833841, -87.60431945)"
1,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,False,False,614,6,1166776.0,1850053.0,2016,5/12/2016 15:48,"(41.744106973, -87.664494285)"
2,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,False,False,2211,22,,,2016,5/12/2016 15:50,
3,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,False,False,531,5,,,2016,5/13/2016 15:51,
4,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,False,False,1712,17,,,2016,5/25/2016 15:59,


## Filtering rows

To filter rows, use `DataFlow.filter`. This method takes an `Expression` as an argument, and returns a new dataflow with the rows in which the expression evaluates to `True`. Expressions are built by indexing the `Dataflow` with a column name (`dataflow['myColumn']`) and regular operators (`>`, `<`, `>=`, `<=`, `==`, `!=`).

### Filtering rows with simple expressions

Index into the Dataflow specifying the column name as a string argument `dataflow['column_name']` and in combination with one of the following standard operators `>, <, >=, <=, ==, !=`, build an expression such as `dataflow['District'] > 9`.  Finally, pass the built expression into the `Dataflow.filter` function.

In this example, `dataflow.filter(dataflow['District'] > 9)` returns a new dataflow with the rows in which the value of "District" is greater than '10' 

*Note that "District" is first converted to numeric, which allows us to build an expression comparing it against other numeric values.*

In [4]:
dflow = dflow.to_number(['District'])
dflow = dflow.filter(dflow['District'] > 9)
dflow.head(5)

Unnamed: 0,Case Number,Date,Block,IUCR,Primary Type,Description,Arrest,Domestic,Beat,District,X Coordinate,Y Coordinate,Year,Updated On,Location
0,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,False,False,2211,22.0,,,2016,5/12/2016 15:50,
1,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,False,False,1712,17.0,,,2016,5/25/2016 15:59,
2,HZ278872,4/15/2016 4:30,004XX S KILBOURN AVE,810,THEFT,OVER $500,False,False,1131,11.0,,,2016,5/25/2016 15:59,
3,HZ240778,4/15/2016 10:00,010XX N MILWAUKEE AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,False,False,1213,12.0,,,2016,5/27/2016 15:45,
4,HZ264802,4/15/2016 16:00,019XX W DIVISION ST,1110,DECEPTIVE PRACTICE,BOGUS CHECK,False,False,1424,14.0,1163094.0,1908003.0,2016,5/16/2016 15:48,"(41.903206037, -87.676361925)"


### Filtering rows with complex expressions

To filter using complex expressions, combine one or more simple expressions with the operators `&`, `|`, and `~`. Please note that the precedence of these operators is lower than that of the comparison operators; therefore, you'll need to use parentheses to group clauses together. 

In this example, `Dataflow.filter` returns a new dataflow with the rows in which "Primary Type" equals 'DECEPTIVE PRACTICE' and "District" is greater than or equal to '10'.

In [5]:
dflow = dflow.to_number(['District'])
dflow = dflow.filter((dflow['Primary Type'] == 'DECEPTIVE PRACTICE') & (dflow['District'] >= 10))
dflow.head(5)

Unnamed: 0,Case Number,Date,Block,IUCR,Primary Type,Description,Arrest,Domestic,Beat,District,X Coordinate,Y Coordinate,Year,Updated On,Location
0,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,False,False,2211,22.0,,,2016,5/12/2016 15:50,
1,HZ240778,4/15/2016 10:00,010XX N MILWAUKEE AVE,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,False,False,1213,12.0,,,2016,5/27/2016 15:45,
2,HZ264802,4/15/2016 16:00,019XX W DIVISION ST,1110,DECEPTIVE PRACTICE,BOGUS CHECK,False,False,1424,14.0,1163094.0,1908003.0,2016,5/16/2016 15:48,"(41.903206037, -87.676361925)"
3,HZ265911,4/15/2016 8:00,061XX N SHERIDAN RD,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,False,False,2433,24.0,,,2016,5/16/2016 15:50,
4,HZ268138,4/15/2016 15:00,023XX W EASTWOOD AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,False,False,1911,19.0,,,2016,5/18/2016 15:50,


It is also possible to filter rows combining more than one expression builder to create a nested expression.

*Note that `'Date'` and `'Updated On'` are first converted to datetime, which allows us to build an expression comparing it against other datetime values.*

In [6]:
dflow = dflow.to_datetime(['Date', 'Updated On'], ['%Y-%m-%d %H:%M:%S'])
dflow = dflow.to_number(['District', 'Y Coordinate'])
comparison_date = datetime(2016,4,13)
dflow = dflow.filter(
    ((dflow['Date'] > comparison_date) | (dflow['Updated On'] > comparison_date))
    | ((dflow['Y Coordinate'] > 1900000) & (dflow['District'] > 10.0)))
dflow.head(5)

Unnamed: 0,Case Number,Date,Block,IUCR,Primary Type,Description,Arrest,Domestic,Beat,District,X Coordinate,Y Coordinate,Year,Updated On,Location
0,HZ264802,"azureml.dataprep.native.DataPrepError(""'Micros...",019XX W DIVISION ST,1110,DECEPTIVE PRACTICE,BOGUS CHECK,False,False,1424,14.0,1163094,1908003.0,2016,"azureml.dataprep.native.DataPrepError(""'Micros...","(41.903206037, -87.676361925)"
