Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add serialization guide to docs #1066

Merged
merged 27 commits into from Jul 29, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
af04be2
add notebook
jeff-hernandez Jul 22, 2021
e20f2ec
Merge branch 'main' into serialization_guide
jeff-hernandez Jul 22, 2021
0a759b8
add to guides index
jeff-hernandez Jul 22, 2021
5ddc450
avoid json highlighting
jeff-hernandez Jul 22, 2021
9033610
Merge branch 'main' into serialization_guide
jeff-hernandez Jul 27, 2021
ef7c805
add section for read_file
jeff-hernandez Jul 27, 2021
8b1f383
Merge branch 'main' into serialization_guide
jeff-hernandez Jul 27, 2021
3a95194
update release notes
jeff-hernandez Jul 27, 2021
e02f938
add example with different formats
jeff-hernandez Jul 27, 2021
451bb1c
remove links
jeff-hernandez Jul 27, 2021
ad4cde9
remove link
jeff-hernandez Jul 27, 2021
866d0c3
cleanup retail directory
jeff-hernandez Jul 27, 2021
249db90
adjust header
jeff-hernandez Jul 27, 2021
8b68909
add metadata to hide cell
jeff-hernandez Jul 27, 2021
2909b30
interweave example
jeff-hernandez Jul 28, 2021
b17873c
Merge branch 'main' into serialization_guide
jeff-hernandez Jul 28, 2021
24ce97f
Update docs/source/guides/saving_and_loading_dataframes.ipynb
jeff-hernandez Jul 28, 2021
a81b9c7
Update docs/source/guides/saving_and_loading_dataframes.ipynb
jeff-hernandez Jul 28, 2021
0989d9d
Update docs/source/guides/saving_and_loading_dataframes.ipynb
jeff-hernandez Jul 28, 2021
a508d95
revert json typing info
jeff-hernandez Jul 28, 2021
29d61d5
Merge branch 'main' into serialization_guide
jeff-hernandez Jul 28, 2021
23fea6d
add example without typing info
jeff-hernandez Jul 28, 2021
c00e12c
fix release notes
jeff-hernandez Jul 29, 2021
195f4dc
Revert "fix release notes"
jeff-hernandez Jul 29, 2021
48f4d33
Revert "update release notes"
jeff-hernandez Jul 29, 2021
613976c
Merge branch 'main' into serialization_guide
jeff-hernandez Jul 29, 2021
d62dd16
update release notes
jeff-hernandez Jul 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/guides/guides_index.rst
Expand Up @@ -11,3 +11,4 @@ The guides below provide more detail on the functionality of Woodwork.
statistical_insights
using_woodwork_with_dask_and_koalas
custom_types_and_type_inference
saving_and_loading_dataframes
264 changes: 264 additions & 0 deletions docs/source/guides/saving_and_loading_dataframes.ipynb
@@ -0,0 +1,264 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "federal-queensland",
"metadata": {},
"source": [
"# Saving and Loading DataFrames\n",
"\n",
"In this guide, you will learn how to save and load Woodwork DataFrames.\n",
"\n",
"## Saving a Woodwork DataFrame\n",
"\n",
"After defining a Woodwork DataFrame with the proper logical types and semantic tags, you can save the DataFrame and typing information by using `DataFrame.ww.to_disk`. This method will create a directory that contains a `data` folder and a `woodwork_typing_info.json` file. To illustrate, we will use this retail DataFrame which already comes configured with Woodwork typing information."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "tender-inventory",
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"from woodwork.demo import load_retail\n",
"df = load_retail(nrows=100)\n",
"df.ww.schema"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "wooden-danish",
"metadata": {},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "designed-soccer",
"metadata": {},
"source": [
"From the `ww` acessor, use `to_disk` to save the Woodwork DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "powered-protein",
"metadata": {},
"outputs": [],
"source": [
"df.ww.to_disk('retail')"
]
},
{
"cell_type": "markdown",
"id": "interstate-tactics",
"metadata": {},
"source": [
"You should see a new directory that contains the data and typing information.\n",
"\n",
"```\n",
"retail\n",
"├── data\n",
"│ └── demo_retail_data.csv\n",
"└── woodwork_typing_info.json\n",
"```\n",
"\n",
"### Data Directory\n",
"\n",
"The `data` directory contains the underlying data written in the specified format. The method derives the filename from `DataFrame.ww.name` and uses CSV as the default format. You can change the format by setting the method's `format` parameter to any of the following formats:\n",
"\n",
"- csv (default)\n",
"- pickle\n",
"- parquet\n",
"\n",
"### Typing Information\n",
"\n",
"In the `woodwork_typing_info.json`, you can see all of the typing information and metadata associated with the DataFrame. This information includes:\n",
"\n",
"- the version of the schema at the time of saving the DataFrame\n",
"- the DataFrame name specified by `DataFrame.ww.name`\n",
"- the column names for the index and time index\n",
"- the column typing information, which contains the logical types with their parameters and semantic tags for each column\n",
"- the loading information required for the DataFrame type and file format\n",
"- the table metadata provided by `DataFrame.ww.metadata` (must be JSON serializable)\n",
"\n",
"```text\n",
"{\n",
" \"schema_version\": \"10.0.2\",\n",
" \"name\": \"demo_retail_data\",\n",
" \"index\": \"order_product_id\",\n",
" \"time_index\": \"order_date\",\n",
" \"column_typing_info\": [...],\n",
" \"loading_info\": {\n",
" \"table_type\": \"pandas\",\n",
" \"location\": \"data/demo_retail_data.csv\",\n",
" \"type\": \"csv\",\n",
" \"params\": {\n",
" \"compression\": null,\n",
" \"sep\": \",\",\n",
" \"encoding\": \"utf-8\",\n",
" \"engine\": \"python\",\n",
" \"index\": false\n",
" }\n",
" },\n",
" \"table_metadata\": {}\n",
"}\n",
"```\n",
"\n",
"## Loading a Woodwork DataFrame\n",
"\n",
"After saving a Woodwork DataFrame, you can load the DataFrame and typing information by using `woodwork.deserialize.read_woodwork_table`. This function will use the stored typing information in the specified directory to recreate the Woodwork DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "oriental-baking",
"metadata": {},
"outputs": [],
"source": [
"from woodwork.deserialize import read_woodwork_table\n",
"df = read_woodwork_table('retail')\n",
"df.ww.schema"
]
},
{
"cell_type": "markdown",
"id": "dependent-liberty",
"metadata": {},
"source": [
"### Loading the DataFrame and typing information separately\n",
"\n",
"You can also load the Woodwork DataFrame and typing information separately by using `woodwork.read_file`. This approach is helpful if you want to save and load the typing information outside the specified directory or read a data file directly into a Woodwork DataFrame. To illustrate, we will load the typing information first before reading data files in different formats directly into a Woodwork DataFrame. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "senior-richards",
"metadata": {},
"outputs": [],
"source": [
"from json import load\n",
"\n",
"with open('retail/woodwork_typing_info.json') as file:\n",
" typing_information = load(file)"
]
},
{
"cell_type": "markdown",
"id": "mighty-bargain",
"metadata": {},
"source": [
"Let's create the data files in different formats from a pandas DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "suspected-transcription",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"pandas_df = pd.read_csv('retail/data/demo_retail_data.csv')\n",
"pandas_df.to_csv('retail/data.csv')\n",
"pandas_df.to_parquet('retail/data.parquet')\n",
"pandas_df.to_feather('retail/data.feather')"
]
},
{
"cell_type": "markdown",
"id": "patent-comfort",
"metadata": {},
"source": [
"Now, you can use `read_file` to load the data directly into a Woodwork DataFrame. This function uses the `content_type` parameter to determine the file format. If `content_type` is not specified, the function will try to infer the file format from the file extension."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "induced-newton",
"metadata": {},
"outputs": [],
"source": [
"from woodwork import read_file\n",
"\n",
"woodwork_df = read_file(\n",
" filepath='retail/data.csv',\n",
" content_type='csv',\n",
jeff-hernandez marked this conversation as resolved.
Show resolved Hide resolved
" index=typing_information['index'],\n",
" time_index=typing_information['time_index'],\n",
")\n",
"\n",
"woodwork_df = read_file(\n",
" filepath='retail/data.parquet',\n",
" content_type='parquet',\n",
jeff-hernandez marked this conversation as resolved.
Show resolved Hide resolved
" index=typing_information['index'],\n",
" time_index=typing_information['time_index'],\n",
")\n",
"\n",
"woodwork_df = read_file(\n",
" filepath='retail/data.feather',\n",
" content_type='feather',\n",
jeff-hernandez marked this conversation as resolved.
Show resolved Hide resolved
" index=typing_information['index'],\n",
" time_index=typing_information['time_index'],\n",
")\n",
"\n",
"woodwork_df.ww"
]
},
{
"cell_type": "markdown",
"id": "detailed-gather",
"metadata": {},
"source": [
"The parameters related to typing information such as the index, time index, logical types, and semantics tags are optional. So, you can read data files into Woodwork DataFrames and let Woodwork inference the typing information automatically."
]
},
{
"cell_type": "code",
jeff-hernandez marked this conversation as resolved.
Show resolved Hide resolved
"execution_count": null,
"id": "freelance-charlotte",
"metadata": {
"nbsphinx": "hidden"
},
"outputs": [],
"source": [
"# cleanup retail directory\n",
"from shutil import rmtree\n",
"rmtree('retail')"
]
}
],
"metadata": {
"celltoolbar": "Edit Metadata",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 2 additions & 1 deletion docs/source/release_notes.rst
Expand Up @@ -9,11 +9,12 @@ Future Release
* Fixes
* Changes
* Documentation Changes
* Add guide for saving and loading Woodwork DataFrames (:pr:`1066`)
* Testing Changes
* Add additional reviewers to minimum and latest dependency checkers (:pr:`1070`, :pr:`1073`, :pr:`1077`)

Thanks to the following people for contributing to this release:
:user:`gsheni`
:user:`gsheni`, :user:`jeff-hernandez`

v0.5.1 Jul 22, 2021
===================
Expand Down