# Summarize
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.<br>

Azure ML Data Prep can help summarize your data by providing you a synopsis based on aggregates over specific columns.

## Table of Contents
[Overview](#Overview)<br>
[Summmary Functions](#Summary-Functions)<br>
* [SummaryFunction.MIN](#SummaryFunction.MIN)<br>
* [SummaryFunction.MAX](#SummaryFunction.MAX)<br>
* [SummaryFunction.MEAN](#SummaryFunction.MEAN)<br>
* [SummaryFunction.MEDIAN](#SummaryFunction.MEDIAN)<br>
* [SummaryFunction.VAR](#SummaryFunction.VAR)<br>
* [SummaryFunction.SD](#SummaryFunction.SD)<br>
* [SummaryFunction.COUNT](#SummaryFunction.COUNT)<br>
* [SummaryFunction.SUM](#SummaryFunction.SUM)<br>
* [SummaryFunction.SKEWNESS](#SummaryFunction.SKEWNESS)<br>
* [SummaryFunction.KURTOSIS](#SummaryFunction.KURTOSIS)

## Overview
Before we drill down into each aggregate function, let us observe `summarize` end to end.

We will start by reading some data.

In [1]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"
5,10140868.0,HY330421,07/05/2015 10:54:00 PM,118XX S PEORIA ST,1320.0,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,False,False,...,34.0,53.0,14,1172409.0,1826485.0,2015.0,07/12/2015 12:42:46 PM,41.679311,-87.644545,"(41.6793109, -87.644545209)"
6,10139762.0,HY329232,07/05/2015 10:42:00 PM,026XX W 37TH PL,1020.0,ARSON,BY FIRE,VACANT LOT/LAND,False,False,...,12.0,58.0,09,1159436.0,1879658.0,2015.0,07/12/2015 12:42:46 PM,41.825501,-87.690578,"(41.825500607, -87.690578042)"
7,10139722.0,HY329228,07/05/2015 10:30:00 PM,016XX S CENTRAL PARK AVE,1811.0,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,False,...,24.0,29.0,18,1152687.0,1891389.0,2015.0,07/12/2015 12:42:46 PM,41.857828,-87.715029,"(41.857827814, -87.715028789)"
8,10139774.0,HY329209,07/05/2015 10:15:00 PM,048XX N ASHLAND AVE,1310.0,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,46.0,3.0,14,1164821.0,1932394.0,2015.0,07/12/2015 12:42:46 PM,41.9701,-87.669324,"(41.970099796, -87.669324377)"
9,10139697.0,HY329177,07/05/2015 10:10:00 PM,058XX S ARTESIAN AVE,1320.0,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,False,False,...,16.0,63.0,14,1160997.0,1865851.0,2015.0,07/12/2015 12:42:46 PM,41.78758,-87.685233,"(41.787580282, -87.685233078)"


Next we count (`SummaryFunction.COUNT`) the number of rows with column ID with non-null values grouped by Primary Type.

In [2]:
dflow_summarize = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='ID',
                summary_column_name='Primary Type ID Counts', 
                summary_function=dprep.SummaryFunction.COUNT)],
        group_by_columns=['Primary Type'])
dflow_summarize.head(10)

Unnamed: 0,Primary Type,Primary Type ID Counts
0,THEFT,1
1,BATTERY,2
2,BURGLARY,1
3,MOTOR VEHICLE THEFT,1
4,CRIMINAL DAMAGE,3
5,ARSON,1
6,NARCOTICS,1


If we choose to not group by anything, we will instead get a single record over the entire dataset. Here we will get the number of rows that have the column ID with non-null values.

In [3]:
dflow_summarize_nogroup = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='ID',
                summary_column_name='ID Count', 
                summary_function=dprep.SummaryFunction.COUNT)])
dflow_summarize_nogroup.head(1)

Unnamed: 0,ID Count
0,10


Conversely, we can group by multiple columns.

In [4]:
dflow_summarize_2group = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='ID',
                summary_column_name='Primary Type & Location Description ID Counts', 
                summary_function=dprep.SummaryFunction.COUNT)],
        group_by_columns=['Primary Type', 'Location Description'])
dflow_summarize_2group.head(10)

Unnamed: 0,Primary Type,Location Description,Primary Type & Location Description ID Counts
0,THEFT,STREET,1
1,BATTERY,STREET,2
2,BURGLARY,SMALL RETAIL STORE,1
3,MOTOR VEHICLE THEFT,STREET,1
4,CRIMINAL DAMAGE,VEHICLE NON-COMMERCIAL,1
5,ARSON,VACANT LOT/LAND,1
6,NARCOTICS,ALLEY,1
7,CRIMINAL DAMAGE,APARTMENT,1
8,CRIMINAL DAMAGE,ALLEY,1


In a similar vein, we can compute multiple aggregates in a single summary. Each aggregate function is independent and it is possible to aggregate the same column multiple times.

In [5]:
dflow_summarize_multi_agg = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='ID',
                summary_column_name='Primary Type ID Counts', 
                summary_function=dprep.SummaryFunction.COUNT),
            dprep.SummaryColumnsValue(
                column_id='ID',
                summary_column_name='Primary Type Min ID', 
                summary_function=dprep.SummaryFunction.MIN),
            dprep.SummaryColumnsValue(
                column_id='Date',
                summary_column_name='Primary Type Max Date', 
                summary_function=dprep.SummaryFunction.MAX)],
        group_by_columns=['Primary Type'])
dflow_summarize_multi_agg.head(10)

Unnamed: 0,Primary Type,Primary Type ID Counts,Primary Type Min ID,Primary Type Max Date
0,THEFT,1,10140490.0,07/05/2015 11:50:00 PM
1,BATTERY,2,10139776.0,07/05/2015 11:30:00 PM
2,BURGLARY,1,10139885.0,07/05/2015 11:19:00 PM
3,MOTOR VEHICLE THEFT,1,10140379.0,07/05/2015 11:00:00 PM
4,CRIMINAL DAMAGE,3,10139697.0,07/05/2015 10:54:00 PM
5,ARSON,1,10139762.0,07/05/2015 10:42:00 PM
6,NARCOTICS,1,10139722.0,07/05/2015 10:30:00 PM


If we wanted this summary data back into our original data set, we can make use of `join_back` and optionally `join_back_columns_prefix` for easy naming distinctions. Summary columns will be added to the end. `group_by_columns` is not necessary for using `join_back`, however the behavior will be more like an append instead of a join.

In [6]:
dflow_summarize_join = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='ID',
                summary_column_name='Primary Type ID Counts', 
                summary_function=dprep.SummaryFunction.COUNT)],
        group_by_columns=['Primary Type'],
        join_back=True,
        join_back_columns_prefix='New_')
dflow_summarize_join.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,New_Primary Type ID Counts
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)",1
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)",2
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,,2
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)",1
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)",1
5,10140868.0,HY330421,07/05/2015 10:54:00 PM,118XX S PEORIA ST,1320.0,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,False,False,...,53.0,14,1172409.0,1826485.0,2015.0,07/12/2015 12:42:46 PM,41.679311,-87.644545,"(41.6793109, -87.644545209)",3
6,10139774.0,HY329209,07/05/2015 10:15:00 PM,048XX N ASHLAND AVE,1310.0,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,3.0,14,1164821.0,1932394.0,2015.0,07/12/2015 12:42:46 PM,41.9701,-87.669324,"(41.970099796, -87.669324377)",3
7,10139697.0,HY329177,07/05/2015 10:10:00 PM,058XX S ARTESIAN AVE,1320.0,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,False,False,...,63.0,14,1160997.0,1865851.0,2015.0,07/12/2015 12:42:46 PM,41.78758,-87.685233,"(41.787580282, -87.685233078)",3
8,10139762.0,HY329232,07/05/2015 10:42:00 PM,026XX W 37TH PL,1020.0,ARSON,BY FIRE,VACANT LOT/LAND,False,False,...,58.0,09,1159436.0,1879658.0,2015.0,07/12/2015 12:42:46 PM,41.825501,-87.690578,"(41.825500607, -87.690578042)",1
9,10139722.0,HY329228,07/05/2015 10:30:00 PM,016XX S CENTRAL PARK AVE,1811.0,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,False,...,29.0,18,1152687.0,1891389.0,2015.0,07/12/2015 12:42:46 PM,41.857828,-87.715029,"(41.857827814, -87.715028789)",1


## Summary Functions
Here we will go over all the possible aggregates in Data Prep.
The most up to date set of functions can be found by enumerating the `SummaryFunction` enum.

In [7]:
import azureml.dataprep as dprep
[x.name for x in dprep.SummaryFunction]

['MIN',
 'MAX',
 'MEAN',
 'MEDIAN',
 'VAR',
 'SD',
 'COUNT',
 'SUM',
 'SKEWNESS',
 'KURTOSIS']

### SummaryFunction.MIN
Data Prep can aggregate and find the minimum value of a column.

In [8]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Date',
                summary_column_name='Primary Type Min Date', 
                summary_function=dprep.SummaryFunction.MIN)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Min Date
0,THEFT,07/05/2015 11:50:00 PM
1,BATTERY,07/05/2015 11:20:00 PM
2,BURGLARY,07/05/2015 11:19:00 PM
3,MOTOR VEHICLE THEFT,07/05/2015 11:00:00 PM
4,CRIMINAL DAMAGE,07/05/2015 10:10:00 PM
5,ARSON,07/05/2015 10:42:00 PM
6,NARCOTICS,07/05/2015 10:30:00 PM


### SummaryFunction.MAX
Data Prep can find the maximum value of a column.

In [9]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Date',
                summary_column_name='Primary Type Max Date', 
                summary_function=dprep.SummaryFunction.MAX)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Max Date
0,THEFT,07/05/2015 11:50:00 PM
1,BATTERY,07/05/2015 11:30:00 PM
2,BURGLARY,07/05/2015 11:19:00 PM
3,MOTOR VEHICLE THEFT,07/05/2015 11:00:00 PM
4,CRIMINAL DAMAGE,07/05/2015 10:54:00 PM
5,ARSON,07/05/2015 10:42:00 PM
6,NARCOTICS,07/05/2015 10:30:00 PM


### SummaryFunction.MEAN
Data Prep can find the statistical mean of a column.

In [10]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Latitude',
                summary_column_name='Primary Type Latitude Mean', 
                summary_function=dprep.SummaryFunction.MEAN)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Latitude Mean
0,THEFT,41.973309
1,BATTERY,42.008124
2,BURGLARY,41.902152
3,MOTOR VEHICLE THEFT,41.88561
4,CRIMINAL DAMAGE,41.81233
5,ARSON,41.825501
6,NARCOTICS,41.857828


### SummaryFunction.MEDIAN
Data Prep can find the median value of a column.

In [11]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Latitude',
                summary_column_name='Primary Type Latitude Median', 
                summary_function=dprep.SummaryFunction.MEDIAN)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Latitude Median
0,THEFT,41.973309
1,BATTERY,42.008124
2,BURGLARY,41.902152
3,MOTOR VEHICLE THEFT,41.88561
4,CRIMINAL DAMAGE,41.78758
5,ARSON,41.825501
6,NARCOTICS,41.857828


### SummaryFunction.VAR
Data Prep can find the statistical variance of a column. We will need more than one data point to calculate this, otherwise we will be unable to give results.

In [12]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Latitude',
                summary_column_name='Primary Type Latitude Variance', 
                summary_function=dprep.SummaryFunction.VAR)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Latitude Variance
0,THEFT,
1,BATTERY,
2,BURGLARY,
3,MOTOR VEHICLE THEFT,
4,CRIMINAL DAMAGE,0.021599
5,ARSON,
6,NARCOTICS,


Note that despite there being two cases of BATTERY, one of them is missing geographical location, thus only CRIMINAL DAMAGE can yield variance information. 

### SummaryFunction.SD
Data Prep can find the standard deviation of a column. We will need more than one data point to calculate this, otherwise we will be unable to give results.

In [13]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Latitude',
                summary_column_name='Primary Type Latitude Standard Deviation', 
                summary_function=dprep.SummaryFunction.SD)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Latitude Standard Deviation
0,THEFT,
1,BATTERY,
2,BURGLARY,
3,MOTOR VEHICLE THEFT,
4,CRIMINAL DAMAGE,0.146966
5,ARSON,
6,NARCOTICS,


Similar to when we calculate variance, despite there being two cases of BATTERY, one of them is missing geographical location, thus only CRIMINAL DAMAGE can yield variance information. 

### SummaryFunction.COUNT
Data Prep can count the number of rows that have a column with non-null values.

In [14]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Latitude',
                summary_column_name='Primary Type Latitude Count', 
                summary_function=dprep.SummaryFunction.COUNT)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Latitude Count
0,THEFT,1
1,BATTERY,1
2,BURGLARY,1
3,MOTOR VEHICLE THEFT,1
4,CRIMINAL DAMAGE,3
5,ARSON,1
6,NARCOTICS,1


Note that despite there being two cases of BATTERY, one of them is missing geographical location, thus when we group by Primary Type, we only get a count of one for Latitude.

### SummaryFunction.SUM
Data Prep can aggregate and sum the values of a column. Our dataset does not have many numerical facts, but here we sum IDs grouped by Primary Type.

In [15]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='ID',
                summary_column_name='Primary Type ID Sum', 
                summary_function=dprep.SummaryFunction.SUM)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type ID Sum
0,THEFT,10140490.0
1,BATTERY,20280046.0
2,BURGLARY,10139885.0
3,MOTOR VEHICLE THEFT,10140379.0
4,CRIMINAL DAMAGE,30420339.0
5,ARSON,10139762.0
6,NARCOTICS,10139722.0


### SummaryFunction.SKEWNESS
Data Prep can calculate the skewness of data in a column. We will need more than one data point to calculate this, otherwise we will be unable to give results.

In [16]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Latitude',
                summary_column_name='Primary Type Latitude Skewness', 
                summary_function=dprep.SummaryFunction.SKEWNESS)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Latitude Skewness
0,THEFT,
1,BATTERY,
2,BURGLARY,
3,MOTOR VEHICLE THEFT,
4,CRIMINAL DAMAGE,0.163631
5,ARSON,
6,NARCOTICS,


### SummaryFunction.KURTOSIS
Data Prep can calculate the kurtosis of data in a column. We will need more than one data point to calculate this, otherwise we will be unable to give results.

In [17]:
import azureml.dataprep as dprep
dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')
dflow_min = dflow.summarize(
        summary_columns=[
            dprep.SummaryColumnsValue(
                column_id='Latitude',
                summary_column_name='Primary Type Latitude Kurtosis', 
                summary_function=dprep.SummaryFunction.KURTOSIS)],
        group_by_columns=['Primary Type'])
dflow_min.head(10)

Unnamed: 0,Primary Type,Primary Type Latitude Kurtosis
0,THEFT,
1,BATTERY,
2,BURGLARY,
3,MOTOR VEHICLE THEFT,
4,CRIMINAL DAMAGE,-2.333333
5,ARSON,
6,NARCOTICS,
