# Getting started with Azure ML Data Prep SDK

#### Note: Some features in this Notebook will _not_ work with the Private Preview version of the SDK; it assumes the Public Preview version.

Wonder how you can make the most of the Azure ML Data Prep SDK? In this "Getting Started" guide, we'll showcase a few highlights that make this SDK shine for big datasets where `pandas` and `dplyr` can fall short. Using the [Ford GoBike dataset](https://www.fordgobike.com/system-data) as an example, we'll cover how to build Dataflows that allow you to:

* [Read in data](#Read-in-data)
* [Get a profile of your data](#Get-data-profile)
* [Apply smart transforms by Microsoft Research](#Derive-by-example)
* [Filter quickly](#Filter-our-data)
* [Apply common data science transforms](#Transform-our-data)
* [Easily handle errors and assertions](#Assert-on-invalid-data)
* [Prepare your dataset for export and machine learning](#Export-for-machine-learning)

In [None]:
from IPython.display import display
from os import path
from tempfile import mkdtemp

import pandas as pd
import azureml.dataprep as dprep

## Read in data

Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet), and also offers the ability to infer column types automatically. 

In [None]:
gobike = dprep\
    .read_csv(
        path = 'https://dprepdata.blob.core.windows.net/demo/ford_gobike/2017-fordgobike-tripdata.csv',
        inference_arguments = dprep.InferenceArguments.current_culture()
    )
gobike.head(5)

In order to iterate more quickly, we can take a sample of our data. Later, we can then apply the same transformations to the entire dataset.

In [None]:
sampled_gobike = gobike.take_sample(probability = 0.1, seed = 5)

## Get data profile

Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics.

In [None]:
gobike.get_profile()

In [None]:
sampled_gobike.get_profile()

It appears that we have quite a few missing values in `member_birth_year`. We also immediately see that we have some empty strings in our `member_gender` column. With the data profiler, we can quickly do a sanity check on our dataset and see where we might need to start data cleaning.

## Derive by example

Azure ML Data Prep comes with additional "smart" transforms created by Microsoft Research. Here, we'll look at how you can derive a new column by providing examples of input-output pairs. Rather than explicitly using regular expressions to extract dates or hours from datetimes, we can provide examples for Azure ML Data Prep to learn what the pattern is. In fact, these smart transformations can also handle more complex derivations like inferring the day of the week from datetimes.

In [None]:
sgb_derived = sampled_gobike\
    .to_string(
        columns = ['start_time', 'end_time']
    )\
    .derive_column_by_example(
        source_columns = 'start_time',
        new_column_name = 'date',
        example_data = [('2017-12-31 16:57:39.6540', '2017-12-31'), ('2017-12-31 16:57:39', '2017-12-31')]
    )\
    .derive_column_by_example(
        source_columns = 'start_time',
        new_column_name = 'hour',
        example_data = [('2017-12-31 16:57:39.6540', '16')]
    )\
    .derive_column_by_example(
        source_columns = 'start_time',
        new_column_name = 'wday',
        example_data = [('2017-12-31 16:57:39.6540', 'Sunday')]
    )

## Filter our data

Let's verify that our derivations are correct by doing a bit of spot-checking.

In [None]:
sgb_derived.filter(dprep.col('wday') != 'Sunday').head(5)

We can also filter on other column types; let's take a peek at rides that lasted over 5 hours.

In [None]:
sgb_derived.filter(dprep.col('duration_sec') > (60 * 60 * 5)).head(5)

## Transform our data

In addition to "smart" transformations, Azure ML Data Prep also supports many common data science transforms familiar to other industry-standard data science libraries. Here, we'll explore the ability to `summarize` and `replace`. We'll also get to use `join` when we handle assertions.

#### Summarize


In [None]:
sgb_summary = sgb_derived\
    .summarize(
        summary_columns = [
            dprep\
                .SummaryColumnsValue(
                    column_id = 'duration_sec', 
                    summary_column_name = 'duration_sec_mean', 
                    summary_function = dprep.SummaryFunction.MEAN
                )
        ],
        group_by_columns = ['date']
    )
sgb_summary.head(5)

Azure Data Prep also makes it easy to append this output of `summarize` to the original table based on the grouping variable. 

In [None]:
sgb_appended = sgb_derived\
    .summarize(
        summary_columns = [
            dprep\
                .SummaryColumnsValue(
                    column_id = 'duration_sec', 
                    summary_column_name = 'duration_sec_mean', 
                    summary_function = dprep.SummaryFunction.MEAN
                )
        ],
        group_by_columns = ['date'],
        join_back = True
    )
sgb_appended.head(5)

#### Replace

Recall that our `member_gender` column had empty strings that stood in place of `None`. Let's use our `replace` function to properly recode them as `None`s.

In [None]:
sgb_replaced = sampled_gobike.replace_na(columns = ['member_gender'])
sgb_replaced.head(5)

## Assert on invalid data 

Azure ML Data Prep helps prevent broken pipelines and safeguard against bad data by supporting assertions. In our case, we'll create assertions to handle potentially erroneous `member_birth_year` values. The oldest person on record is no more than 130 years old, so birth year listed as before 1900 is wrong. Though our `sampled_gobike` dataset doesn't have any issues, we would fail on the full `gobike` dataset if we made that assumption. However, Azure ML Data Prep allows us to handle these gracefully with assertions.

In [None]:
gb_asserted = gobike\
    .assert_value(
        columns = 'member_birth_year', 
        expression = dprep.f_or(dprep.value.is_null(), dprep.value >= 1900),
        error_code = 'InvalidDate'
    )
gb_asserted.get_profile()

Now, we can filter to see what caused the 2 errors above:

In [None]:
gb_errors = gb_asserted.filter(dprep.col('member_birth_year').is_error())
gb_errors.head(5)

#### Join
But what were the original values? Let's use `join` to figure out what the values were that caused our assert to throw an error. 

In [None]:
gb_errors.join(
    left_dataflow = gb_errors,
    right_dataflow = gobike,
    join_key_pairs = [
        ('duration_sec', 'duration_sec'),
        ('start_station_id', 'start_station_id'),
        ('bike_id', 'bike_id')
    ]
).head(5)

If we look at `r_member_birth_year`, we see that these people were listed as being born in 1886. That's impossible! Now that we've identified outliers and anomalies, we can appropriately clean our data however we like.

## Export for machine learning

One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out; it takes care of figuring out how. To do so, you can export the `.dprep` file you've written tested on a smaller dataset, then run it with your larger dataset. Here, we show how you can export your new package. For a more detailed example on how to execute it on Spark, check out our [New York Taxicab scenario](https://github.com/Microsoft/PendletonDocs/blob/master/Scenarios/NYTaxiCab/01.new_york_taxi.ipynb).

In [None]:
gobike = gobike.set_name(name = "gobike")
package_path = path.join(mkdtemp(), "gobike.dprep")

print("Saving package to: {}".format(package_path))
package = dprep.Package(arg = gobike)
package.save(file_path = package_path)

## Want more information?

Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:

* Cache your Dataflow to speed up your iterations
* Add your custom Python transforms
* Impute missing values
* Sample your data
* Reference and link between Dataflows
* Apply your Dataflow to a new, larger data source