# Data Profile

A DataProfile collects summary statistics on the data produced by a Dataflow. `Dataflow.get_profile()` executes the Dataflow and returns a newly constructed DataProfile.

In [1]:
import azureml.dataprep as dprep

df = dprep.smart_read_file('data/crime0-10.csv')
profile = df.get_profile()

A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics.

In [2]:
profile.columns['ID']

ColumnProfile
    name: ID
    type: FieldType.DECIMAL

    min: 10139697.0
    max: 10140868.0
    count: 10.0
    missing_count: 0.0
    error_count: 0.0

    lower_quartile: 10139762.0
    median: 10139830.5
    upper_quartile: 10140379.0
    std: 409.8056585475928
    mean: 10140062.299999999

We can also extract a specific attribute across all columns by using a dict comprehension.

In [3]:
column_types = { c.name: c.type for c in profile.columns.values() }
column_types

{'Arrest': <FieldType.BOOLEAN: 1>,
 'Beat': <FieldType.DECIMAL: 3>,
 'Block': <FieldType.STRING: 0>,
 'Case Number': <FieldType.STRING: 0>,
 'Community Area': <FieldType.DECIMAL: 3>,
 'Date': <FieldType.DATE: 4>,
 'Description': <FieldType.STRING: 0>,
 'District': <FieldType.DECIMAL: 3>,
 'Domestic': <FieldType.BOOLEAN: 1>,
 'FBI Code': <FieldType.STRING: 0>,
 'ID': <FieldType.DECIMAL: 3>,
 'IUCR': <FieldType.DECIMAL: 3>,
 'Latitude': <FieldType.DECIMAL: 3>,
 'Location': <FieldType.STRING: 0>,
 'Location Description': <FieldType.STRING: 0>,
 'Longitude': <FieldType.DECIMAL: 3>,
 'Primary Type': <FieldType.STRING: 0>,
 'Updated On': <FieldType.DATE: 4>,
 'Ward': <FieldType.DECIMAL: 3>,
 'X Coordinate': <FieldType.DECIMAL: 3>,
 'Y Coordinate': <FieldType.DECIMAL: 3>,
 'Year': <FieldType.DECIMAL: 3>}