Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create using Woodwork with Dask guide #304

Merged
merged 6 commits into from
Oct 23, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/guides/guides_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ The guides below provide more detail on the functionality of Woodwork.
understanding_types_and_tags
setting_config_options
using_mi_and_describe
using_woodwork_with_dask
157 changes: 157 additions & 0 deletions docs/source/guides/using_woodwork_with_dask.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using Woodwork with Dask Dataframes\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataframes -> DataFrames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"\n",
"Woodwork enables DataTables to be created from Dask dataframes when working with datasets that are too large to easily fit in memory. Although creating a DataTable from a Dask dataframe follows the same process as one would follow when creating a DataTable from a pandas dataframe, there are a few limitations to be aware of. This guide will provide a brief overview of creating a DataTable starting with a Dask dataframe, and will outline several key items to keep in mind when using a Dask dataframe as input.\n",
"\n",
"First we will create a Dask dataframe to use in our example. Normally you would create this dataframe directly by reading in the data from saved files, but we will create it from a demo pandas dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import dask.dataframe as dd\n",
"import woodwork as ww\n",
"\n",
"df_pandas = ww.demo.load_retail(nrows=1000, return_dataframe=True)\n",
thehomebrewnerd marked this conversation as resolved.
Show resolved Hide resolved
"df_dask = dd.from_pandas(df_pandas, npartitions=10)\n",
"df_dask"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a Dask dataframe, we can use this dataframe to create a Woodwork DataTable, just as we would with a pandas DataFrame:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this sentence, we have:

  • Dask dataframe
  • Woodwork DataTable
  • pandas DataFrame

which doesn't seem consistent from a capitalization standpoint. I think we usually capitalize Woodwork and Dask, but not pandas--is there a reason for that?

With DataFrame/DataTable, it's probably not as clear what we want to do. I think I've been trying to stick to pascal case when actually talking about the object and not capitalizing when referring to it as a word, but I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've typically adopted the conventions from the respective libraries. If you look at the pandas documentation, they do not capitalize pandas, whereas Dask does in their documentation.

Regarding DataFrame/DataTable/dataframe - I will review my usage of this as I haven't been too consistent, but my thoughts are similar to yours. If we are referring to a specific object or implementation - like Dask DataFrame, we should use pascal case, but if we are referring to a more general representation that is not tied to a specific implementation, we should follow normal capitalization rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense to follow each library's conventions!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through everything and tried to clean up the capitalization. Hopefully I caught everything.

]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dt = ww.DataTable(df_dask, index='order_product_id')\n",
"dt.types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see from the output above, the DataTable was created successfully, and logical type inference was performed for all of the columns. However, this brings us to one of the key issues in working with Dask dataframes. \n",
"\n",
"In order to perform logical type inference, Woodwork needs to bring the data from the dataframe into memory so it can be analyzed. Currently, Woodwork reads data from the first partition of data only, and then uses this data for type inference. Depending on the complexity of the data, this could be a time consuming operation. Additionally, if the first partition is not representative of the entire dataset, the logical types for some columns may be inferred incorrectly.\n",
"\n",
"If this process takes too much time, or if the logical types are not inferred correctly, users have the ability to manually specify the logical types for each column. If the logical type for a column is specified, type inference for that column will be skipped. If logical types are specified for all columns, logical type inference will be skipped completely and Woodwork will not need to bring any of the data into memory when creating the DataTable.\n",
"\n",
"To skip logical type inference completely, and/or to correct type inference issues, you would simply define a logical types dictionary with the correct logical type defined for each column in the dataframe. Then, pass that dictionary to the call to create the DataTable as shown below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"logical_types = {\n",
" 'order_product_id': 'WholeNumber',\n",
" 'order_id': 'Categorical',\n",
" 'product_id': 'Categorical',\n",
" 'description': 'NaturalLanguage',\n",
" 'quantity': 'WholeNumber',\n",
" 'order_date': 'Datetime',\n",
" 'unit_price': 'Double',\n",
" 'customer_name': 'FullName',\n",
" 'country': 'Categorical',\n",
" 'total': 'Double',\n",
" 'cancelled': 'Boolean',\n",
"}\n",
"\n",
"dt = ww.DataTable(df_dask, index='order_product_id', logical_types=logical_types)\n",
"dt.types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two DataTable methods that also require bringing the underlying dataframe into memory: `describe` and `get_mutual_information`. When called, both of these methods will call a `compute` operation on the Dask datframe associated with the DataTable in order to compute the desired information. This may be problematic for datasets that cannot fit in memory, so exercise caution when using these methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dt.describe(include=['numeric'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dt.get_mutual_information().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Validation Limitations\n",
"\n",
"When creating a DataTable several validation checks are performed to confirm that the data in the dataframe is appropriate for the specified parameters. Because some of these validation steps require pulling the underlying data into memory, they are skipped when creating a DataTable from a Dask dataframe. This section provides an overview of the validation checks that are performed with pandas input but skipped with Dask input.\n",
"\n",
"### Index Uniqueness\n",
"Normally a check is performed to verify that any column specified as the index contains no duplicate values. This check is skipped for Dask, so users must manually verify that any column specified as an index column contains unique values.\n",
"\n",
"### Data Consistency with LogicalType\n",
"If users manually define the LogicalType for a column when creating the datatable, a check is performed to verify that the data in that column is appropriate for the specified LogicalType. For example, with pandas input if the user specifies a LogicalType of `Double` for a column that contains letters such as `['a', 'b', 'c']`, an error would be raised as it is not possible to convert the letters into numeric values with the `float` dtype associated with the `Double` LogicalType.\n",
"\n",
"With Dask input, no such error would be raised at the time of DataTable creation. However, behind the scenes, Woodwork will have attempted to convert the column physical type to `float`, and this conversion would be added to the Dask task graph, without raising an error. However, an error will be raised if a `compute` operation is called on the underlying dataframe once Dask attempts to execute the conversion step. Extra care should be taken when using Dask input to make sure any specified logical types are consistent with the data in the columns to avoid this type of error.\n",
"\n",
"Similarly, for the `Ordinal` LogicalType, a check is typically performed to make sure that the data column does not contain any values that are not present in the defined order values. This check will not be performed with Dask input. Users should manually verify that the defined order values are complete to avoid unexpected results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Other Limitations\n",
"\n",
"Woodwork provides the ability to read data directly from a CSV file into a DataTable, and during this process Woodwork creates the underlying dataframe so the user does not have to do so. The helper function used for this, `woodwork.read_csv`, will currently only read the data into a pandas dataframe. At some point, we hope to remove this limitation and also allow data to be read into a Dask DataFrame, but for now only pandas dataframes can be created by this function."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Release Notes
* Update ``DataTable.describe`` to work with Dask input (:pr:`296`)
* Update ``DataTable.get_mutual_information`` to work with Dask input (:pr:`300`)
* Documentation Changes
* Create a guide for using Woodwork with Dask (:pr:`304`)
* Testing Changes
* Parameterize numeric time index tests (:pr:`288`)

Expand Down