New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pandas and fix load_fraud()
#486
Changes from all commits
c6c8c8c
7bc7ebc
34d9ada
8ba9e59
a8438bc
b4e1276
05d1f54
30d36d9
4fc65ce
0f7e013
7348dc3
1a0e1d4
e9d05e4
c3ba501
5817e52
4887b88
3f101d7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,12 @@ | ||
import pandas as pd | ||
from dask import dataframe as dd | ||
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit | ||
|
||
|
||
def load_data(path, index, label, n_rows=None, drop=None, verbose=True, **kwargs): | ||
"""Load features and labels from file(s). | ||
|
||
Args: | ||
path (str) : path to file(s) | ||
path (str) : path to file or a http/ftp/s3 URL | ||
index (str) : column for index | ||
label (str) : column for labels | ||
n_rows (int) : number of rows to return | ||
|
@@ -17,22 +16,12 @@ def load_data(path, index, label, n_rows=None, drop=None, verbose=True, **kwargs | |
Returns: | ||
pd.DataFrame, pd.Series : features and labels | ||
""" | ||
if '*' in path: | ||
feature_matrix = dd.read_csv(path, **kwargs).set_index(index, sorted=True) | ||
|
||
labels = [label] + (drop or []) | ||
y = feature_matrix[label].compute() | ||
X = feature_matrix.drop(labels=labels, axis=1).compute() | ||
feature_matrix = pd.read_csv(path, index_col=index, nrows=n_rows, **kwargs) | ||
|
||
if n_rows: | ||
X = X.head(n_rows) | ||
y = y.head(n_rows) | ||
else: | ||
feature_matrix = pd.read_csv(path, index_col=index, nrows=n_rows, **kwargs) | ||
|
||
labels = [label] + (drop or []) | ||
y = feature_matrix[label] | ||
X = feature_matrix.drop(columns=labels) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jeremyliweishih nice, you beat me to it RE #315 ! Just for clarity, are these changes needed to support pandas 1.0.0? Or is this just something you wanted to do? I'm on board either way, although technically if it were the latter, it should be in a separate PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we support pandas 1.0.0 it introduces a future warning when importing evalml since Dask hasn't properly silenced it yet. I think it should be in this PR! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah ok, cool, then yeah I'm on board with deleting the dask dependency in requirements.txt too! |
||
labels = [label] + (drop or []) | ||
y = feature_matrix[label] | ||
X = feature_matrix.drop(columns=labels) | ||
|
||
if verbose: | ||
# number of features | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,7 @@ | ||
scipy>=1.2.1 | ||
scikit-learn>=0.21.3,!=0.22 | ||
dask[complete]>=2.1.0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎉 |
||
numpy>=1.16.4 | ||
pandas>=0.25.0,<1.0.0 | ||
pandas>=0.25.0 | ||
xgboost>=0.82 | ||
tqdm>=4.33.0 | ||
scikit-optimize[plots] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, we used dask to support loading globs. i think it's fine to remove that, but we could update the doc string above for path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it, I'll update the docstring!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we no long support files. It has to be a single file