Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dask an optional dependency #357

Merged
merged 14 commits into from Nov 9, 2020
38 changes: 30 additions & 8 deletions .circleci/config.yml
Expand Up @@ -32,6 +32,8 @@ jobs:
parameters:
image_tag:
type: string
optional_libraries:
type: string
executor:
name: python
image_tag: << parameters.image_tag >>
Expand All @@ -45,8 +47,20 @@ jobs:
source venv/bin/activate
python -m pip config --site set global.progress_bar off
python -m pip install --upgrade pip
python -m pip install -e unpacked_sdist/
python -m pip install -r unpacked_sdist/test-requirements.txt
- when:
condition:
equal: [ "optional", << parameters.optional_libraries >> ]
steps:
- run: |
python -m pip install -e unpacked_sdist/[dask]
python -m pip install -r unpacked_sdist/test-requirements.txt
- unless:
condition:
equal: ["optional", << parameters.optional_libraries >> ]
steps:
- run: |
python -m pip install -e unpacked_sdist/
python -m pip install -r unpacked_sdist/test-requirements.txt
- persist_to_workspace:
root: ~/woodwork
paths:
Expand All @@ -58,6 +72,8 @@ jobs:
parameters:
image_tag:
type: string
optional_libraries:
type: string
executor:
name: python
image_tag: << parameters.image_tag >>
Expand All @@ -67,7 +83,9 @@ jobs:
at: ~/woodwork
- when:
condition:
equal: [ "3.6", << parameters.image_tag >> ]
and:
- equal: [ "3.6", << parameters.image_tag >> ]
- equal: ["optional", << parameters.optional_libraries >>]
steps:
- run: |
source venv/bin/activate
Expand All @@ -81,7 +99,9 @@ jobs:
codecov --required
- unless:
condition:
equal: [ "3.6", << parameters.image_tag >> ]
and:
- equal: [ "3.6", << parameters.image_tag >> ]
- equal: ["optional", << parameters.optional_libraries >>]
steps:
- run: |
source venv/bin/activate
Expand Down Expand Up @@ -143,23 +163,25 @@ workflows:
matrix:
parameters:
image_tag: ["3.8", "3.7", "3.6"]
name: << matrix.image_tag >> install woodwork
optional_libraries: ["minimal", "optional"]
name: << matrix.image_tag >> install woodwork << matrix.optional_libraries >>
context: dockerhub
- unit_tests:
matrix:
parameters:
image_tag: ["3.8", "3.7", "3.6"]
name: << matrix.image_tag >> unit tests
optional_libraries: ["minimal", "optional"]
name: << matrix.image_tag >> unit tests << matrix.optional_libraries >>
requires:
- << matrix.image_tag >> install woodwork
- << matrix.image_tag >> install woodwork << matrix.optional_libraries >>
context: dockerhub
- lint_test:
matrix:
parameters:
image_tag: ["3.8", "3.7", "3.6"]
name: << matrix.image_tag >> lint test
requires:
- << matrix.image_tag >> install woodwork
- << matrix.image_tag >> install woodwork optional
context: dockerhub
- release_notes_updated:
name: "release notes updated"
Expand Down
1 change: 1 addition & 0 deletions dask-requirements.txt
@@ -0,0 +1 @@
dask[dataframe]>=2.30.0
3 changes: 2 additions & 1 deletion dev-requirements.txt
Expand Up @@ -12,4 +12,5 @@ ipython==7.18.1; python_version>'3.6'
pygments==2.7.0
jupyter==1.0.0
pandoc==1.0.2
ipykernel==5.3.4
ipykernel==5.3.4
-r dask-requirements.txt
7 changes: 6 additions & 1 deletion docs/source/guides/using_woodwork_with_dask.ipynb
Expand Up @@ -8,6 +8,11 @@
"\n",
"Woodwork enables DataTables to be created from Dask DataFrames when working with datasets that are too large to easily fit in memory. Although creating a DataTable from a Dask DataFrame follows the same process as one would follow when creating a DataTable from a pandas DataFrame, there are a few limitations to be aware of. This guide will provide a brief overview of creating a DataTable starting with a Dask DataFrame, and will outline several key items to keep in mind when using a Dask DataFrame as input.\n",
"\n",
"Dask DataTables require the installation of the Dask library, which can be installed directly with the following command:\n",
"```bash\n",
"pip install \"woodwork[dask]\"\n",
gsheni marked this conversation as resolved.
Show resolved Hide resolved
"```\n",
"\n",
"First we will create a Dask DataFrame to use in our example. Normally you would create the DataFrame directly by reading in the data from saved files, but we will create it from a demo pandas DataFrame."
]
},
Expand Down Expand Up @@ -162,7 +167,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down
17 changes: 16 additions & 1 deletion docs/source/install.ipynb
Expand Up @@ -16,6 +16,18 @@
"python -m pip install woodwork\n",
"```\n",
"\n",
"Woodwork allows users to install add-ons individually or all at once. In order to install all add-ons, run:\n",
"```bash\n",
"python -m pip install \"woodwork[complete]\"\n",
"```\n",
"\n",
"You can use Woodwork to create Dask DataTables by running:\n",
"\n",
"```bash\n",
"python -m pip install \"woodwork[dask]\"\n",
"```\n",
"\n",
"\n",
"\n",
"## Conda\n",
"\n",
Expand All @@ -25,6 +37,9 @@
"conda install -c conda-forge woodwork\n",
"```\n",
"\n",
".. note::\n",
" In order to create Dask DataTables, the following command must be run prior to installing Woodwork with conda: `conda install dask`\n",
"\n",
"\n",
"\n",
"## Source\n",
Expand Down Expand Up @@ -61,7 +76,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Expand Up @@ -25,6 +25,7 @@ Release Notes
* Remove ``copy`` parameter from ``DataTable.to_dataframe`` and ``DataColumn.to_series`` (:pr:`338`)
* Allow pandas ExtensionArrays as inputs to DataColumn (:pr:`343`)
* Move warnings to a separate exceptions file and call via UserWarning subclasses (:pr:`348`)
* Make Dask an optional dependency installable with woodwork[dask] (:pr:`357`)
* Documentation Changes
* Create a guide for using Woodwork with Dask (:pr:`304`)
* Add conda install instructions (:pr:`305`, :pr:`309`)
Expand Down
1 change: 0 additions & 1 deletion requirements.txt
@@ -1,6 +1,5 @@
numpy>=1.19.1
pandas>=1.1.0
click>=7.1.2
dask[dataframe]>=2.30.0
scikit-learn>=0.21.3
pyarrow>=2.0.0
3 changes: 3 additions & 0 deletions setup.py
Expand Up @@ -6,6 +6,8 @@
with open(path.join(dirname, 'README.md')) as f:
long_description = f.read()

extras_require = {'dask': open('dask-requirements.txt').readlines()}
extras_require['complete'] = sorted(set(sum(extras_require.values(), [])))

setup(
name='woodwork',
Expand All @@ -27,6 +29,7 @@
install_requires=open('requirements.txt').readlines(),
tests_require=open('test-requirements.txt').readlines(),
python_requires='>=3.6, <4',
extras_require=extras_require,
keywords='data science machine learning typing',
include_package_data=True,
entry_points={
Expand Down
14 changes: 9 additions & 5 deletions woodwork/data_column.py
@@ -1,6 +1,5 @@
import warnings

import dask.dataframe as dd
import pandas as pd
import pandas.api.types as pdtypes

Expand All @@ -26,9 +25,12 @@
from woodwork.utils import (
_convert_input_to_set,
_get_ltype_class,
col_is_datetime
col_is_datetime,
import_or_none
)

dd = import_or_none('dask.dataframe')


class DataColumn(object):
def __init__(self, series,
Expand Down Expand Up @@ -94,7 +96,7 @@ def _update_dtype(self):
# Update the underlying series
try:
if _get_ltype_class(self.logical_type) == Datetime:
if isinstance(self._series, dd.Series):
if dd and isinstance(self._series, dd.Series):
name = self._series.name
self._series = dd.to_datetime(self._series, format=self.logical_type.datetime_format)
self._series.name = name
Expand Down Expand Up @@ -132,7 +134,9 @@ def set_logical_type(self, logical_type, retain_index_tags=True):
return new_col

def _set_series(self, series):
if not (isinstance(series, pd.Series) or isinstance(series, dd.Series) or isinstance(series, pd.api.extensions.ExtensionArray)):
if not (isinstance(series, pd.Series) or
(dd and isinstance(series, dd.Series)) or
isinstance(series, pd.api.extensions.ExtensionArray)):
raise TypeError('Series must be a pandas Series, Dask Series, or a pandas ExtensionArray')

# pandas ExtensionArrays should be converted to pandas.Series
Expand Down Expand Up @@ -323,7 +327,7 @@ def infer_logical_type(series):
Args:
series (pd.Series): Input Series
"""
if isinstance(series, dd.Series):
if dd and isinstance(series, dd.Series):
series = series.get_partition(0).compute()
natural_language_threshold = config.get_option('natural_language_threshold')

Expand Down
17 changes: 10 additions & 7 deletions woodwork/data_table.py
@@ -1,6 +1,5 @@
import warnings

import dask.dataframe as dd
import pandas as pd
from sklearn.metrics.cluster import normalized_mutual_info_score

Expand All @@ -19,9 +18,12 @@
_get_ltype_class,
_get_mode,
_is_numeric_series,
col_is_datetime
col_is_datetime,
import_or_none
)

dd = import_or_none('dask.dataframe')


class DataTable(object):
def __init__(self, dataframe,
Expand Down Expand Up @@ -72,7 +74,7 @@ def __init__(self, dataframe,
self._dataframe = dataframe

if make_index:
if isinstance(self._dataframe, dd.DataFrame):
if dd and isinstance(self._dataframe, dd.DataFrame):
self._dataframe[index] = 1
self._dataframe[index] = self._dataframe[index].cumsum() - 1
else:
Expand Down Expand Up @@ -545,7 +547,7 @@ def describe(self, include=None):

results = {}

if isinstance(self._dataframe, dd.DataFrame):
if dd and isinstance(self._dataframe, dd.DataFrame):
df = self._dataframe.compute()
else:
df = self._dataframe
Expand Down Expand Up @@ -628,7 +630,7 @@ def value_counts(self, ascending=False, top_n=10, dropna=False):
val_counts = {}
valid_cols = [col for col, column in self.columns.items() if column._is_categorical()]
data = self._dataframe[valid_cols]
if isinstance(data, dd.DataFrame):
if dd and isinstance(data, dd.DataFrame):
data = data.compute()

for col in valid_cols:
Expand Down Expand Up @@ -725,7 +727,7 @@ def get_mutual_information(self, num_bins=10, nrows=None):
_get_ltype_class(column.logical_type) == Boolean)
)]
data = self._dataframe[valid_columns]
if isinstance(data, dd.DataFrame):
if dd and isinstance(data, dd.DataFrame):
data = data.compute()

# cut off data if necessary
Expand Down Expand Up @@ -814,7 +816,8 @@ def to_parquet(self, path, compression=None, profile_name=None):

def _validate_params(dataframe, name, index, time_index, logical_types, semantic_tags, make_index):
"""Check that values supplied during DataTable initialization are valid"""
if not isinstance(dataframe, (pd.DataFrame, dd.DataFrame)):
if not ((dd and isinstance(dataframe, dd.DataFrame)) or
isinstance(dataframe, pd.DataFrame)):
raise TypeError('Dataframe must be one of: pandas.DataFrame, dask.DataFrame')
_check_unique_column_names(dataframe)
if name and not isinstance(name, str):
Expand Down
11 changes: 8 additions & 3 deletions woodwork/deserialize.py
Expand Up @@ -6,15 +6,14 @@
from itertools import zip_longest
from pathlib import Path

import dask.dataframe as dd
import pandas as pd

from woodwork import DataTable
from woodwork.exceptions import OutdatedSchemaWarning, UpgradeSchemaWarning
from woodwork.logical_types import str_to_logical_type
from woodwork.s3_utils import get_transport_params, use_smartopen
from woodwork.serialize import FORMATS, SCHEMA_VERSION
from woodwork.utils import _is_s3, _is_url
from woodwork.utils import _is_s3, _is_url, import_or_raise


def read_table_metadata(path):
Expand Down Expand Up @@ -60,7 +59,13 @@ def metadata_to_datatable(table_metadata, **kwargs):
table_type = loading_info.get('table_type', 'pandas')

if table_type == 'dask':
lib = dd
DASK_ERR_MSG = (
'Cannot load Dask DataTable - unable to import Dask.\n\n'
'Please install with pip or conda:\n\n'
'python -m pip install "woodwork[dask]"\n\n'
'conda install dask'
)
lib = import_or_raise('dask.dataframe', DASK_ERR_MSG)
else:
lib = pd

Expand Down
13 changes: 7 additions & 6 deletions woodwork/serialize.py
Expand Up @@ -4,16 +4,17 @@
import tarfile
import tempfile

import dask.dataframe as dd

from woodwork.s3_utils import get_transport_params, use_smartopen
from woodwork.utils import (
_get_ltype_class,
_get_specified_ltype_params,
_is_s3,
_is_url
_is_url,
import_or_none
)

dd = import_or_none('dask.dataframe')

SCHEMA_VERSION = '1.0.0'
FORMATS = ['csv', 'pickle', 'parquet']

Expand All @@ -40,7 +41,7 @@ def datatable_to_metadata(datatable):
for col in datatable.columns.values()
]

if isinstance(df, dd.DataFrame):
if dd and isinstance(df, dd.DataFrame):
table_type = 'dask'
else:
table_type = 'pandas'
Expand Down Expand Up @@ -114,7 +115,7 @@ def write_table_data(datatable, path, format='csv', **kwargs):
dt_name = datatable.name or 'data'
df = datatable.to_dataframe()

if isinstance(df, dd.DataFrame) and format == 'csv':
if dd and isinstance(df, dd.DataFrame) and format == 'csv':
basename = "{}-*.{}".format(dt_name, format)
else:
basename = '.'.join([dt_name, format])
Expand All @@ -131,7 +132,7 @@ def write_table_data(datatable, path, format='csv', **kwargs):
)
elif format == 'pickle':
# Dask currently does not support to_pickle
if isinstance(df, dd.DataFrame):
if dd and isinstance(df, dd.DataFrame):
msg = 'Cannot serialize Dask DataTable to pickle'
raise ValueError(msg)
df.to_pickle(file, **kwargs)
Expand Down