# Data Profile
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

A DataProfile collects summary statistics on the data produced by a Dataflow. `Dataflow.get_profile()` executes the Dataflow and returns a newly constructed DataProfile.

In [1]:
import azureml.dataprep as dprep

df = dprep.smart_read_file('data/crime0-10.csv')
profile = df.get_profile()
profile

Unnamed: 0,Type,Min,Max,Count,Missing Count,Error Count,Lower Quartile,Upper Quartile,Standard Deviation,Mean
Location,FieldType.STRING,,"(42.008124017, -87.65955018)",10.0,0.0,0.0,,,,
X Coordinate,FieldType.DECIMAL,1.12923e+06,1.17241e+06,10.0,1.0,0.0,1149950.0,1167630.0,14052.0,1157450.0
Description,FieldType.STRING,$500 AND UNDER,TO VEHICLE,10.0,0.0,0.0,,,,
IUCR,FieldType.DECIMAL,460,1811,10.0,0.0,0.0,610.0,1320.0,435.056,1008.7
Y Coordinate,FieldType.DECIMAL,1.82648e+06,1.94627e+06,10.0,1.0,0.0,1876210.0,1932620.0,37733.2,1898270.0
Primary Type,FieldType.STRING,ARSON,THEFT,10.0,0.0,0.0,,,,
Latitude,FieldType.DECIMAL,41.6793,42.0081,10.0,1.0,0.0,41.816,41.9709,0.103645,41.8766
FBI Code,FieldType.STRING,05,18,10.0,0.0,0.0,,,,
Community Area,FieldType.DECIMAL,1,63,10.0,0.0,0.0,10.0,53.0,23.2811,32.3
Ward,FieldType.DECIMAL,9,49,10.0,0.0,0.0,16.0,41.0,14.1676,29.5


In [2]:
print(str(profile))

ColumnProfile
    name: Location
    type: FieldType.STRING

    min: 
    max: (42.008124017, -87.65955018)
    count: 10.0
    missing_count: 0.0
    error_count: 0.0

ColumnProfile
    name: X Coordinate
    type: FieldType.DECIMAL

    min: 1129230.0
    max: 1172409.0
    count: 10.0
    missing_count: 1.0
    error_count: 0.0

    lower_quartile: 1149945.5
    median: 1160997.0
    upper_quartile: 1167630.75
    std: 14052.038898125928
    mean: 1157453.7777777778

ColumnProfile
    name: Description
    type: FieldType.STRING

    min: $500 AND UNDER
    max: TO VEHICLE
    count: 10.0
    missing_count: 0.0
    error_count: 0.0

ColumnProfile
    name: IUCR
    type: FieldType.DECIMAL

    min: 460.0
    max: 1811.0
    count: 10.0
    missing_count: 0.0
    error_count: 0.0

    lower_quartile: 610.0
    median: 975.0
    upper_quartile: 1320.0
    std: 435.0555647781608
    mean: 1008.7

ColumnProfile
    name: Y Coordinate
    type: FieldType.DECIMAL

    min: 1826485.0
    

A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics.

In [3]:
profile.columns['ID']

Unnamed: 0,Statistics
Type,FieldType.DECIMAL
Min,1.01397e+07
Max,1.01409e+07
Count,10
Missing Count,0
Error Count,0
Lower Quartile,1.01398e+07
Upper Quartile,1.01404e+07
Standard Deviation,409.806
Mean,1.01401e+07


We can also extract a specific attribute across all columns by using a dict comprehension.

In [4]:
column_types = { c.name: c.type for c in profile.columns.values() }
column_types

{'Arrest': <FieldType.BOOLEAN: 1>,
 'Beat': <FieldType.DECIMAL: 3>,
 'Block': <FieldType.STRING: 0>,
 'Case Number': <FieldType.STRING: 0>,
 'Community Area': <FieldType.DECIMAL: 3>,
 'Date': <FieldType.DATE: 4>,
 'Description': <FieldType.STRING: 0>,
 'District': <FieldType.DECIMAL: 3>,
 'Domestic': <FieldType.BOOLEAN: 1>,
 'FBI Code': <FieldType.STRING: 0>,
 'ID': <FieldType.DECIMAL: 3>,
 'IUCR': <FieldType.DECIMAL: 3>,
 'Latitude': <FieldType.DECIMAL: 3>,
 'Location': <FieldType.STRING: 0>,
 'Location Description': <FieldType.STRING: 0>,
 'Longitude': <FieldType.DECIMAL: 3>,
 'Primary Type': <FieldType.STRING: 0>,
 'Updated On': <FieldType.DATE: 4>,
 'Ward': <FieldType.DECIMAL: 3>,
 'X Coordinate': <FieldType.DECIMAL: 3>,
 'Y Coordinate': <FieldType.DECIMAL: 3>,
 'Year': <FieldType.DECIMAL: 3>}

A ColumnProfile may also contain a summary of the most common three values with their respective counts. (This is only available if the column has fewer than a thousand unique values.)

In [5]:
profile.columns['Primary Type'].value_counts

[ValueCountEntry(value='CRIMINAL DAMAGE', count=3),
 ValueCountEntry(value='BATTERY', count=2),
 ValueCountEntry(value='NARCOTICS', count=1)]

Numeric ColumnProfiles include an estimated histogram of the data.

In [6]:
profile.columns['District'].histogram

[HistogramBucket(lower_bound=5.0, upper_bound=6.9, count=1.1333333333333335),
 HistogramBucket(lower_bound=6.9, upper_bound=8.8, count=1.1666666666666672),
 HistogramBucket(lower_bound=8.8, upper_bound=10.7, count=1.549999999999999),
 HistogramBucket(lower_bound=10.7, upper_bound=12.6, count=0.8499999999999996),
 HistogramBucket(lower_bound=12.6, upper_bound=14.5, count=0.6333333333333337),
 HistogramBucket(lower_bound=14.5, upper_bound=16.4, count=1.2666666666666666),
 HistogramBucket(lower_bound=16.4, upper_bound=18.299999999999997, count=0.47499999999999964),
 HistogramBucket(lower_bound=18.299999999999997, upper_bound=20.2, count=0.4750000000000014),
 HistogramBucket(lower_bound=20.2, upper_bound=22.099999999999998, count=0.47499999999999787),
 HistogramBucket(lower_bound=22.099999999999998, upper_bound=24.0, count=0.9750000000000014)]