Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pandas and fix load_fraud() #486

Merged
merged 17 commits into from Mar 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/changelog.rst
Expand Up @@ -8,6 +8,7 @@ Changelog
* Fixes
* Changes
* Undo version cap in XGBoost placed in :pr:`402` and allowed all released of XGBoost :pr:`407`
* Support pandas 1.0.0 :pr:`486`
* Documentation Changes
* Updated API reference to remove PipelinePlot and added moved PipelineBase plotting methods :pr:`483`
* Testing Changes
Expand Down
5 changes: 2 additions & 3 deletions evalml/guardrails/utils.py
Expand Up @@ -33,9 +33,8 @@ def detect_label_leakage(X, y, threshold=.95):
if len(X.columns) == 0:
return {}

corrs = X.corrwith(y).abs()
out = corrs[corrs >= threshold]
return out.to_dict()
corrs = {label: abs(y.corr(col)) for label, col in X.iteritems() if abs(y.corr(col)) >= threshold}
return corrs


def detect_highly_null(X, percent_threshold=.95):
Expand Down
21 changes: 5 additions & 16 deletions evalml/preprocessing/utils.py
@@ -1,13 +1,12 @@
import pandas as pd
from dask import dataframe as dd
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit


def load_data(path, index, label, n_rows=None, drop=None, verbose=True, **kwargs):
"""Load features and labels from file(s).

Args:
path (str) : path to file(s)
path (str) : path to file or a http/ftp/s3 URL
index (str) : column for index
label (str) : column for labels
n_rows (int) : number of rows to return
Expand All @@ -17,22 +16,12 @@ def load_data(path, index, label, n_rows=None, drop=None, verbose=True, **kwargs
Returns:
pd.DataFrame, pd.Series : features and labels
"""
if '*' in path:
feature_matrix = dd.read_csv(path, **kwargs).set_index(index, sorted=True)

labels = [label] + (drop or [])
y = feature_matrix[label].compute()
X = feature_matrix.drop(labels=labels, axis=1).compute()
feature_matrix = pd.read_csv(path, index_col=index, nrows=n_rows, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, we used dask to support loading globs. i think it's fine to remove that, but we could update the doc string above for path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, I'll update the docstring!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we no long support files. It has to be a single file


if n_rows:
X = X.head(n_rows)
y = y.head(n_rows)
else:
feature_matrix = pd.read_csv(path, index_col=index, nrows=n_rows, **kwargs)

labels = [label] + (drop or [])
y = feature_matrix[label]
X = feature_matrix.drop(columns=labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih nice, you beat me to it RE #315 !

Just for clarity, are these changes needed to support pandas 1.0.0? Or is this just something you wanted to do? I'm on board either way, although technically if it were the latter, it should be in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support pandas 1.0.0 it introduces a future warning when importing evalml since Dask hasn't properly silenced it yet. I think it should be in this PR!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, cool, then yeah I'm on board with deleting the dask dependency in requirements.txt too!

labels = [label] + (drop or [])
y = feature_matrix[label]
X = feature_matrix.drop(columns=labels)

if verbose:
# number of features
Expand Down
1 change: 0 additions & 1 deletion evalml/tests/guardrail_tests/test_detect_label_leakage.py
Expand Up @@ -15,5 +15,4 @@ def test_detect_label_leakage():
y = y.astype(bool)

result = detect_label_leakage(X, y)

assert set(["a", "b", "c", "d"]) == set(result.keys())
4 changes: 1 addition & 3 deletions evalml/tests/latest_dependency_versions.txt
@@ -1,10 +1,8 @@
catboost==0.22
category-encoders==2.1.0
cloudpickle==1.3.0
dask==2.12.0
distributed==2.12.0
numpy==1.18.1
pandas==0.25.3
pandas==1.0.1
pyzmq==19.0.0
scikit-learn==0.22.2.post1
scikit-optimize==0.7.4
Expand Down
3 changes: 1 addition & 2 deletions requirements.txt
@@ -1,8 +1,7 @@
scipy>=1.2.1
scikit-learn>=0.21.3,!=0.22
dask[complete]>=2.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

numpy>=1.16.4
pandas>=0.25.0,<1.0.0
pandas>=0.25.0
xgboost>=0.82
tqdm>=4.33.0
scikit-optimize[plots]
Expand Down