# AML Data Prep SDK Demos

This notebook is designed to show some of the ways that the Data Prep SDK works and makes things easier as well as augmenting libraries like Pandas

In [29]:
import pandas as pd
import azureml.dataprep as dprep
from azureml.dataprep import f_and, value

## Automatic File Reading

Trying to parse all the params to files can be very time consuming, the Data Prep API tries to use ML to determine the file type and its params under the covers to make the API call simple and somewhat robust to changes over time.

In [30]:
dflow = dprep.auto_read_file('deepdata.txt')
df = dflow.head(5)


## Lazy Eval and Streaming

Dataflows are not like DataFrames, they are lazy evaluated so assigning the results of a call to head does not cause evaluation, just calling head does though. But then printing the variable that was assigned to does force evaluation.

In addition the data is never completely loaded into memory it is always streamed thus can handle datasets much larger than memory.

In [31]:
df

Unnamed: 0,Name,CompanyName,SalesPerson,EmailAddress,Founded,Last Order,Sales to Date,City,postal_code,latitude,longitude
0,Mr. Seth Juarez,A Bike Store,adventure-works\pamela0,orlando0@adventure-works.com,21-Feb-73,531.0,95962473.0,San Francisco,94122.0,37.758941,-222.48591
1,Ms Katherine Harding,Vintage Sport Boutique,adventure-works\david8,kendra0@adventure-works.com,5 November 1880,68343.0,342244200.0,San Francisco,94122.0,37.758941,-122.48591
2,Mrs Kami LeMonds,Trendy Department Stores,adventure-works\shu0,donald1@adventure-works.com,8-Oct-43,83287.0,92839201.0,San Francisco,94122.0,37.758941,-122.48591
3,Mr. Andrew Cencini,Sports Merchandise,adventure-works\pamela0,andrew2@adventure-works.com,1915,58533.0,43569020.0,SJ,94115.0,37.782632,-122.432504
4,Mr. Darren Gehring,Journey Sporting Goods,adventure-works\jillian0,darren0@adventure-works.com,28-Jul-32,65744.0,38783980.0,San Antonio,94133.0,47.609722,-122.333056


Sometimes its hard to trust "magical" methods, anywhere the Data Prep API uses these "magical" methods its also possible to get access to the underlying methods which provide more transparency on whats going on, and usually allow the settings to ve overridden.

In [53]:
ff = dprep.detect_file_format('deepdata.txt')
print(ff.file_format)

ParseDelimitedProperties
    separator: '|'
    headers_mode: PromoteHeadersMode.CONSTANTGROUPED
    encoding: FileEncoding.UTF8
    quoting: False
    skip_rows: 3
    skip_mode: SkipMode.GROUPED
    comment: None



## Type System

When head is called the dataflow actually returns a pandas dataframe, so the types are standard python types, but when dtypes is called on the dflow it returns the native dflow types.

In [33]:
dflow.head(1).dtypes

Name              object
CompanyName       object
SalesPerson       object
EmailAddress      object
Founded           object
Last Order       float64
Sales to Date    float64
City              object
postal_code      float64
latitude         float64
longitude        float64
dtype: object

In [34]:
dflow.dtypes

Name                     FieldType.STRING
CompanyName              FieldType.STRING
SalesPerson              FieldType.STRING
EmailAddress             FieldType.STRING
Founded                  FieldType.STRING
Last Order               FieldType.DECIMAL
Sales to Date            FieldType.DECIMAL
City                     FieldType.STRING
postal_code              FieldType.DECIMAL
latitude                 FieldType.DECIMAL
longitude                FieldType.DECIMAL

As we saw in the previous notebooks getting profile information on our data is key to data prep, there is a specific API for this in Data Prep.

The profile gives us a nice statistical overview thats similar to pandas_profiling but it is not opinonated about what is good or bad.

The API also allows to look at specific columns and to drill into those columns for histograms and frequency tables.

In [35]:
profile = dflow.get_profile()
profile

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
Name,FieldType.STRING,Mr. Alexander J. Deborde,Ms. Yuhong Li,66.0,0.0,66.0,0.0,0.0,0.0,,,,,,,,,,,,,,
CompanyName,FieldType.STRING,A Bike Store,Wholesale Parts,66.0,0.0,66.0,0.0,0.0,0.0,,,,,,,,,,,,,,
SalesPerson,FieldType.STRING,adventure-works\david8,adventure-works\shu0,66.0,0.0,66.0,0.0,0.0,0.0,,,,,,,,,,,,,,
EmailAddress,FieldType.STRING,aidan0@adventure-works.com,yuhong1@adventure-works.com,66.0,0.0,66.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Founded,FieldType.STRING,10-May-35,9-Jul-01,66.0,0.0,66.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Last Order,FieldType.DECIMAL,138,99489,66.0,0.0,66.0,0.0,0.0,0.0,138.0,17432.1,17362.0,36954.0,58385.5,73814.0,95511.0,99373.2,99489.0,55340.5,26393.2,696603000.0,-0.273553,-0.899927
Sales to Date,FieldType.DECIMAL,2.31478e+06,3.42244e+08,66.0,0.0,66.0,0.0,0.0,0.0,2314780.0,9271950.0,9263630.0,30020300.0,56007400.0,85201400.0,109087000.0,332982000.0,342244000.0,63223500.0,55489100.0,3079040000000000.0,2.76122,10.8135
City,FieldType.STRING,S.A.,San Jose,66.0,0.0,66.0,0.0,0.0,0.0,,,,,,,,,,,,,,
postal_code,FieldType.DECIMAL,94103,94133,66.0,0.0,66.0,0.0,0.0,0.0,94103.0,94107.1,94107.0,94115.0,94122.0,94133.0,94133.0,94133.0,94133.0,94122.6,10.3493,107.107,-0.395359,-1.29098
latitude,FieldType.DECIMAL,37.7129,47.6097,66.0,0.0,66.0,0.0,0.0,0.0,37.7129,37.7438,37.7427,37.7589,37.7845,37.7984,47.6097,47.6097,47.6097,38.3725,2.36435,5.59014,3.59906,11.124


In [36]:
profile.columns['Sales to Date'].histogram

[HistogramBucket(lower_bound=2314778.0, upper_bound=36307720.2, count=19.9435657265208),
 HistogramBucket(lower_bound=36307720.2, upper_bound=70300662.4, count=18.826865814121565),
 HistogramBucket(lower_bound=70300662.4, upper_bound=104293604.60000001, count=23.830304954852217),
 HistogramBucket(lower_bound=104293604.60000001, upper_bound=138286546.8, count=0.7039129284339296),
 HistogramBucket(lower_bound=138286546.8, upper_bound=172279489.0, count=0.3750903312441025),
 HistogramBucket(lower_bound=172279489.0, upper_bound=206272431.20000002, count=0.24878195727981023),
 HistogramBucket(lower_bound=206272431.20000002, upper_bound=240265373.40000004, count=0.24878195727980312),
 HistogramBucket(lower_bound=240265373.40000004, upper_bound=274258315.6, count=0.24878195727980312),
 HistogramBucket(lower_bound=274258315.6, upper_bound=308251257.8, count=0.4866790927209763),
 HistogramBucket(lower_bound=308251257.8, upper_bound=342244200.0, count=1.0872352802669951)]

In [37]:
profile.columns['City'].value_counts

[ValueCountEntry(value='San Francisco', count=15.0),
 ValueCountEntry(value='San Antonio', count=14.0),
 ValueCountEntry(value='SA', count=6.0),
 ValueCountEntry(value='San Jose', count=6.0),
 ValueCountEntry(value='SAN FRANCISCO', count=5.0),
 ValueCountEntry(value='S.A.', count=4.0),
 ValueCountEntry(value='S.J.', count=4.0),
 ValueCountEntry(value='SJ', count=3.0),
 ValueCountEntry(value='SAN JOSE', count=3.0),
 ValueCountEntry(value='S.D.', count=3.0),
 ValueCountEntry(value='San Diego', count=3.0)]

Its clear from this frequency table that we have the same city listed multiple times using different forms, this is a very common problem to solve and involves using fuzzy grouping logic, the Data Prep API has this built in using tech from Microsoft Research thats also used in Power BI and SQL Server Integration Services.

In [38]:
dflow2 = dflow.fuzzy_group_column(source_column='City',
                                        new_column_name='Clean City',
                                        similarity_threshold=0.8)


In [41]:
dflow2.head(10)

Unnamed: 0,Name,CompanyName,SalesPerson,EmailAddress,Founded,Last Order,Sales to Date,City,Clean City,postal_code,latitude,longitude
0,Mr. Seth Juarez,A Bike Store,adventure-works\pamela0,orlando0@adventure-works.com,21-Feb-73,531.0,95962473.0,San Francisco,San Francisco,94122.0,37.758941,-222.48591
1,Ms Katherine Harding,Vintage Sport Boutique,adventure-works\david8,kendra0@adventure-works.com,5 November 1880,68343.0,342244200.0,San Francisco,San Francisco,94122.0,37.758941,-122.48591
2,Mrs Kami LeMonds,Trendy Department Stores,adventure-works\shu0,donald1@adventure-works.com,8-Oct-43,83287.0,92839201.0,San Francisco,San Francisco,94122.0,37.758941,-122.48591
3,Mr. Andrew Cencini,Sports Merchandise,adventure-works\pamela0,andrew2@adventure-works.com,1915,58533.0,43569020.0,SJ,San Jose,94115.0,37.782632,-122.432504
4,Mr. Darren Gehring,Journey Sporting Goods,adventure-works\jillian0,darren0@adventure-works.com,28-Jul-32,65744.0,38783980.0,San Antonio,San Antonio,94133.0,47.609722,-122.333056
5,Ms. Rebecca Laszlo,Instruments and Parts Company,adventure-works\jae0,rebecca2@adventure-works.com,8-Oct-43,18491.0,82452139.0,S.A.,San Antonio,94133.0,37.797817,-122.408597
6,Mr. Daniel P. Thompson,Travel Sports,adventure-works\pamela0,daniel2@adventure-works.com,10-Oct-50,49186.0,40685181.0,San Francisco,San Francisco,94122.0,37.758941,-122.48591
7,Mr. Paulo H. Lisboa,Elite Bikes,adventure-works\jillian0,paulo0@adventure-works.com,11 August 1897,87506.0,284357790.0,SAN JOSE,San Jose,94103.0,37.771437,-122.423892
8,Ms. Aidan Delaney,Paint Supply,adventure-works\jillian0,aidan0@adventure-works.com,10-May-35,24681.0,147720300.0,SA,San Antonio,94133.0,37.797398,-122.405322
9,Ms. Hattie J. Haemon,Greater Bike Store,adventure-works\jose1,hattie0@adventure-works.com,1928,79859.0,99428898.0,S.A.,San Antonio,94133.0,37.797468,-122.406147


In [54]:
profile = dflow2.get_profile()
profile.columns['City'].value_counts

[ValueCountEntry(value='San Francisco', count=15.0),
 ValueCountEntry(value='San Antonio', count=14.0),
 ValueCountEntry(value='SA', count=6.0),
 ValueCountEntry(value='San Jose', count=6.0),
 ValueCountEntry(value='SAN FRANCISCO', count=5.0),
 ValueCountEntry(value='S.A.', count=4.0),
 ValueCountEntry(value='S.J.', count=4.0),
 ValueCountEntry(value='SJ', count=3.0),
 ValueCountEntry(value='SAN JOSE', count=3.0),
 ValueCountEntry(value='S.D.', count=3.0),
 ValueCountEntry(value='San Diego', count=3.0)]

In [55]:
profile.columns['Clean City'].value_counts

[ValueCountEntry(value='San Antonio', count=24.0),
 ValueCountEntry(value='San Francisco', count=20.0),
 ValueCountEntry(value='San Jose', count=16.0),
 ValueCountEntry(value='San Diego', count=6.0)]

Looking at the `City` and `Clean City` columns shows that the number of values has been reduced by the fuzzy operation which added the new column.

As fuzzy grouping is one of the "magic" methods there are more detailed methods to help you dig in under the covers here.

Now we need to take that new `Clean City` column and turn it into an integer as we did for the Titanic dataset, the Data Prep API has a nice helper method to Label Encode, which performs like a Pandas Map.

In [56]:
dflow3 = dflow2.label_encode(source_column='Clean City', new_column_name='City Label')
dflow3.head(10)

Unnamed: 0,Name,CompanyName,SalesPerson,EmailAddress,Founded,Last Order,Sales to Date,City,Clean City,City Label,postal_code,latitude,longitude
0,Mr. Seth Juarez,A Bike Store,adventure-works\pamela0,orlando0@adventure-works.com,21-Feb-73,531.0,95962473.0,San Francisco,San Francisco,1,94122.0,37.758941,-222.48591
1,Ms Katherine Harding,Vintage Sport Boutique,adventure-works\david8,kendra0@adventure-works.com,5 November 1880,68343.0,342244200.0,San Francisco,San Francisco,1,94122.0,37.758941,-122.48591
2,Mrs Kami LeMonds,Trendy Department Stores,adventure-works\shu0,donald1@adventure-works.com,8-Oct-43,83287.0,92839201.0,San Francisco,San Francisco,1,94122.0,37.758941,-122.48591
3,Mr. Andrew Cencini,Sports Merchandise,adventure-works\pamela0,andrew2@adventure-works.com,1915,58533.0,43569020.0,SJ,San Jose,2,94115.0,37.782632,-122.432504
4,Mr. Darren Gehring,Journey Sporting Goods,adventure-works\jillian0,darren0@adventure-works.com,28-Jul-32,65744.0,38783980.0,San Antonio,San Antonio,0,94133.0,47.609722,-122.333056
5,Ms. Rebecca Laszlo,Instruments and Parts Company,adventure-works\jae0,rebecca2@adventure-works.com,8-Oct-43,18491.0,82452139.0,S.A.,San Antonio,0,94133.0,37.797817,-122.408597
6,Mr. Daniel P. Thompson,Travel Sports,adventure-works\pamela0,daniel2@adventure-works.com,10-Oct-50,49186.0,40685181.0,San Francisco,San Francisco,1,94122.0,37.758941,-122.48591
7,Mr. Paulo H. Lisboa,Elite Bikes,adventure-works\jillian0,paulo0@adventure-works.com,11 August 1897,87506.0,284357790.0,SAN JOSE,San Jose,2,94103.0,37.771437,-122.423892
8,Ms. Aidan Delaney,Paint Supply,adventure-works\jillian0,aidan0@adventure-works.com,10-May-35,24681.0,147720300.0,SA,San Antonio,0,94133.0,37.797398,-122.405322
9,Ms. Hattie J. Haemon,Greater Bike Store,adventure-works\jose1,hattie0@adventure-works.com,1928,79859.0,99428898.0,S.A.,San Antonio,0,94133.0,37.797468,-122.406147


## Derived Column by Example using Program Synthesis

Remember all the code that we had to write to extract the titles from the name in the Titanic dataset? Well there is a "magic" capability that uses a form of ML to train a model from examples that are provided and then derive the new column using that model.

In [63]:
df_source = dflow3.head(100)

In [64]:
builder = dflow3.builders.derive_column_by_example(source_columns=['Name'], new_column_name='Salutation')
builder.add_example(source_data=df_source.iloc[0], example_value='Mr')
builder.preview()

Unnamed: 0,Name,Salutation
0,Mr. Seth Juarez,Mr
1,Ms Katherine Harding,Ms
2,Mrs Kami LeMonds,Mrs
3,Mr. Andrew Cencini,Mr
4,Mr. Darren Gehring,Mr
5,Ms. Rebecca Laszlo,Ms
6,Mr. Daniel P. Thompson,Mr
7,Mr. Paulo H. Lisboa,Mr
8,Ms. Aidan Delaney,Ms
9,Ms. Hattie J. Haemon,Ms


Ok thats less code but not super magical...what if we wanted to extract the first name instead/as well, that would be a bunch more/different code right? Well know the model is trained by examples of the intended data so lets provide different examples.

In [65]:
builder = dflow3.builders.derive_column_by_example(source_columns=['Name'], new_column_name='First Name')
builder.add_example(source_data=df_source.iloc[0], example_value='Seth')
builder.preview()

Unnamed: 0,Name,First Name
0,Mr. Seth Juarez,Seth
1,Ms Katherine Harding,Katherine
2,Mrs Kami LeMonds,Kami
3,Mr. Andrew Cencini,Andrew
4,Mr. Darren Gehring,Darren
5,Ms. Rebecca Laszlo,Rebecca
6,Mr. Daniel P. Thompson,Daniel P.
7,Mr. Paulo H. Lisboa,Paulo H.
8,Ms. Aidan Delaney,Aidan
9,Ms. Hattie J. Haemon,Hattie J.


Ok thats pretty good but its clearly not perfect, how do we fix it? Well its gets it wrong for Daniel, Paulo etc because it does not have examples of what we want it to do. So lets give it examples of how to handle middle initial.

In [66]:
builder = dflow3.builders.derive_column_by_example(source_columns=['Name'], new_column_name='First Name')
builder.add_example(source_data=df_source.iloc[0], example_value='Seth')
builder.add_example(source_data=df_source.iloc[6], example_value='Daniel')
builder.preview()

Unnamed: 0,Name,First Name
0,Mr. Seth Juarez,Seth
1,Ms Katherine Harding,Katherine
2,Mrs Kami LeMonds,Kami
3,Mr. Andrew Cencini,Andrew
4,Mr. Darren Gehring,Darren
5,Ms. Rebecca Laszlo,Rebecca
6,Mr. Daniel P. Thompson,Daniel
7,Mr. Paulo H. Lisboa,Paulo
8,Ms. Aidan Delaney,Aidan
9,Ms. Hattie J. Haemon,Hattie


Magic!

Ok lets go back to our original salutation example...

In [67]:
builder = dflow3.builders.derive_column_by_example(source_columns=['Name'], new_column_name='Salutation')
builder.add_example(source_data=df_source.iloc[0], example_value='Mr')
builder.preview()

Unnamed: 0,Name,Salutation
0,Mr. Seth Juarez,Mr
1,Ms Katherine Harding,Ms
2,Mrs Kami LeMonds,Mrs
3,Mr. Andrew Cencini,Mr
4,Mr. Darren Gehring,Mr
5,Ms. Rebecca Laszlo,Ms
6,Mr. Daniel P. Thompson,Mr
7,Mr. Paulo H. Lisboa,Mr
8,Ms. Aidan Delaney,Ms
9,Ms. Hattie J. Haemon,Ms


That we are happy with our forked experiment apply it back to the dataflow

In [68]:
dflow4 = builder.to_dataflow()
dflow4.head(10)

Unnamed: 0,Name,Salutation,CompanyName,SalesPerson,EmailAddress,Founded,Last Order,Sales to Date,City,Clean City,City Label,postal_code,latitude,longitude
0,Mr. Seth Juarez,Mr,A Bike Store,adventure-works\pamela0,orlando0@adventure-works.com,21-Feb-73,531.0,95962473.0,San Francisco,San Francisco,1,94122.0,37.758941,-222.48591
1,Ms Katherine Harding,Ms,Vintage Sport Boutique,adventure-works\david8,kendra0@adventure-works.com,5 November 1880,68343.0,342244200.0,San Francisco,San Francisco,1,94122.0,37.758941,-122.48591
2,Mrs Kami LeMonds,Mrs,Trendy Department Stores,adventure-works\shu0,donald1@adventure-works.com,8-Oct-43,83287.0,92839201.0,San Francisco,San Francisco,1,94122.0,37.758941,-122.48591
3,Mr. Andrew Cencini,Mr,Sports Merchandise,adventure-works\pamela0,andrew2@adventure-works.com,1915,58533.0,43569020.0,SJ,San Jose,2,94115.0,37.782632,-122.432504
4,Mr. Darren Gehring,Mr,Journey Sporting Goods,adventure-works\jillian0,darren0@adventure-works.com,28-Jul-32,65744.0,38783980.0,San Antonio,San Antonio,0,94133.0,47.609722,-122.333056
5,Ms. Rebecca Laszlo,Ms,Instruments and Parts Company,adventure-works\jae0,rebecca2@adventure-works.com,8-Oct-43,18491.0,82452139.0,S.A.,San Antonio,0,94133.0,37.797817,-122.408597
6,Mr. Daniel P. Thompson,Mr,Travel Sports,adventure-works\pamela0,daniel2@adventure-works.com,10-Oct-50,49186.0,40685181.0,San Francisco,San Francisco,1,94122.0,37.758941,-122.48591
7,Mr. Paulo H. Lisboa,Mr,Elite Bikes,adventure-works\jillian0,paulo0@adventure-works.com,11 August 1897,87506.0,284357790.0,SAN JOSE,San Jose,2,94103.0,37.771437,-122.423892
8,Ms. Aidan Delaney,Ms,Paint Supply,adventure-works\jillian0,aidan0@adventure-works.com,10-May-35,24681.0,147720300.0,SA,San Antonio,0,94133.0,37.797398,-122.405322
9,Ms. Hattie J. Haemon,Ms,Greater Bike Store,adventure-works\jose1,hattie0@adventure-works.com,1928,79859.0,99428898.0,S.A.,San Antonio,0,94133.0,37.797468,-122.406147


Anywhere there is an assumption from set of operations or one version of a dataframe to another it is a form of data contract, where downstream operations do not guarantee what they will do if the contract is broken, so we should be defensive about making sure the contract is being met. One way to do this is to use asserts, here we are going to assert that there is a valid range for the `Last Order`

In [49]:
dflow5 = dflow4.assert_value(
        columns='Last Order', 
        expression=dprep.f_and(value > 1000, value < 100000),
        error_code='InvalidRange'
    )
dflow5.head(10)

Unnamed: 0,Name,Salutation,CompanyName,SalesPerson,EmailAddress,Founded,Last Order,Sales to Date,City,postal_code,latitude,longitude
0,Mr. Seth Juarez,Mr,A Bike Store,adventure-works\pamela0,orlando0@adventure-works.com,21-Feb-73,"azureml.dataprep.native.DataPrepError(""'Invali...",95962473.0,San Francisco,94122.0,37.758941,-222.48591
1,Ms Katherine Harding,Ms,Vintage Sport Boutique,adventure-works\david8,kendra0@adventure-works.com,5 November 1880,68343,342244200.0,San Francisco,94122.0,37.758941,-122.48591
2,Mrs Kami LeMonds,Mrs,Trendy Department Stores,adventure-works\shu0,donald1@adventure-works.com,8-Oct-43,83287,92839201.0,San Francisco,94122.0,37.758941,-122.48591
3,Mr. Andrew Cencini,Mr,Sports Merchandise,adventure-works\pamela0,andrew2@adventure-works.com,1915,58533,43569020.0,SJ,94115.0,37.782632,-122.432504
4,Mr. Darren Gehring,Mr,Journey Sporting Goods,adventure-works\jillian0,darren0@adventure-works.com,28-Jul-32,65744,38783980.0,San Antonio,94133.0,47.609722,-122.333056
5,Ms. Rebecca Laszlo,Ms,Instruments and Parts Company,adventure-works\jae0,rebecca2@adventure-works.com,8-Oct-43,18491,82452139.0,S.A.,94133.0,37.797817,-122.408597
6,Mr. Daniel P. Thompson,Mr,Travel Sports,adventure-works\pamela0,daniel2@adventure-works.com,10-Oct-50,49186,40685181.0,San Francisco,94122.0,37.758941,-122.48591
7,Mr. Paulo H. Lisboa,Mr,Elite Bikes,adventure-works\jillian0,paulo0@adventure-works.com,11 August 1897,87506,284357790.0,SAN JOSE,94103.0,37.771437,-122.423892
8,Ms. Aidan Delaney,Ms,Paint Supply,adventure-works\jillian0,aidan0@adventure-works.com,10-May-35,24681,147720300.0,SA,94133.0,37.797398,-122.405322
9,Ms. Hattie J. Haemon,Ms,Greater Bike Store,adventure-works\jose1,hattie0@adventure-works.com,1928,79859,99428898.0,S.A.,94133.0,37.797468,-122.406147


Looking at the first row we can see there is an error object of the `InvalidRange`. Its possible to stop the processing at this point, in this case we are simply adding a message that we can use to fork this rows and others like it for exception processing in a different flow.

All of these operations on the flow form a sort of graph, it can be executed without the code because of the way the Data Prep engine works and generates code. So here we serialise it and we can re-open it using the API or manage, deploy, execute using Command Line Tools.

In [50]:
dflow5 = dflow5.set_name(name="DeepData")
package_path = "DeepData.dprep"

package = dprep.Package(arg=dflow5)
package.save(file_path=package_path)

Package
  name: None
  path: C:\dev\events\VSLIVE0319\Dev\VSLIVE_0319\DeepData.dprep
  dataflows: [
    Dataflow {
      name: DeepData
      steps: 6
    },
  ]

The Data Prep API is not designed to replace Pandas, Pandas has years of development by hundreds of people, it is designed to augment, make somethings easier/better. As such Pandas interop is critical so that you can use methods and approaches that are familiar and well documented in the community when you need to, so our last step is to convert to a data frame, notice our error record has now gone and is replaced by a NaN as Pandas does not understand Data Prep error objects.

In [52]:
df = dflow5.to_pandas_dataframe()
df.head()

Unnamed: 0,Name,Salutation,CompanyName,SalesPerson,EmailAddress,Founded,Last Order,Sales to Date,City,postal_code,latitude,longitude
0,Mr. Seth Juarez,Mr,A Bike Store,adventure-works\pamela0,orlando0@adventure-works.com,21-Feb-73,,95962473.0,San Francisco,94122.0,37.758941,-222.48591
1,Ms Katherine Harding,Ms,Vintage Sport Boutique,adventure-works\david8,kendra0@adventure-works.com,5 November 1880,68343.0,342244200.0,San Francisco,94122.0,37.758941,-122.48591
2,Mrs Kami LeMonds,Mrs,Trendy Department Stores,adventure-works\shu0,donald1@adventure-works.com,8-Oct-43,83287.0,92839201.0,San Francisco,94122.0,37.758941,-122.48591
3,Mr. Andrew Cencini,Mr,Sports Merchandise,adventure-works\pamela0,andrew2@adventure-works.com,1915,58533.0,43569020.0,SJ,94115.0,37.782632,-122.432504
4,Mr. Darren Gehring,Mr,Journey Sporting Goods,adventure-works\jillian0,darren0@adventure-works.com,28-Jul-32,65744.0,38783980.0,San Antonio,94133.0,47.609722,-122.333056
