Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiclass pipeline #21

Merged
merged 77 commits into from Sep 10, 2019
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
2c7da32
Added multiclass functionality for logr and xgboost
jeremyliweishih Aug 22, 2019
7ef4a81
Merge remote-tracking branch 'origin' into multiclass_pipeline
jeremyliweishih Aug 22, 2019
6122985
Merge
jeremyliweishih Aug 22, 2019
f5ed9ad
Multi works with pipelines but not with auto
jeremyliweishih Aug 22, 2019
234bada
Added multiclass metrics and fixed tests
jeremyliweishih Aug 23, 2019
f016973
Added to ensure pipelines are distinct
jeremyliweishih Aug 23, 2019
a679e8c
Refactor xgboost
jeremyliweishih Aug 26, 2019
870665c
Added as a argument and updated get_objective
jeremyliweishih Aug 26, 2019
9e03614
Updated test_multi_auto and test_serialization
jeremyliweishih Aug 26, 2019
24f590a
lint
jeremyliweishih Aug 26, 2019
9d1ddf9
Separated tests
jeremyliweishih Aug 26, 2019
78ddd7a
Switched default to binary
jeremyliweishih Aug 26, 2019
ce25260
lint
jeremyliweishih Aug 26, 2019
125a2ab
improts
jeremyliweishih Aug 26, 2019
e0d7311
Fixed default objectives for multiclass
jeremyliweishih Aug 26, 2019
dfbea51
Merge remote-tracking branch 'origin' into multiclass_pipeline
jeremyliweishih Aug 26, 2019
d69cbdf
lint
jeremyliweishih Aug 26, 2019
8863e74
Check for attribute first
jeremyliweishih Aug 26, 2019
7443a1f
Made multi data pd
jeremyliweishih Aug 26, 2019
ef0ee97
Import
jeremyliweishih Aug 26, 2019
3145bf7
DataFrame
jeremyliweishih Aug 26, 2019
bce0ed2
remove pd
jeremyliweishih Aug 26, 2019
8793d59
Split into separate metrics
jeremyliweishih Aug 27, 2019
f845e9f
Moved binarze to _handle_predictions
jeremyliweishih Aug 27, 2019
79e6291
Added all cases to get_objevtive
jeremyliweishih Aug 27, 2019
52f52e3
Added utility for getting types of objectives
jeremyliweishih Aug 27, 2019
cd0cfb5
Fix tests
jeremyliweishih Aug 27, 2019
da0934e
lint
jeremyliweishih Aug 27, 2019
9174306
Added multiclass parameter and updated get_objective
jeremyliweishih Aug 27, 2019
281f7ba
Fixed ROC and LogLoss
jeremyliweishih Aug 27, 2019
95e5e45
Update tests'
jeremyliweishih Aug 27, 2019
b1dab35
Fix binary prob
jeremyliweishih Aug 27, 2019
91e2621
Added multiclass importance for LR and removed warning for others
jeremyliweishih Aug 28, 2019
ccaa9ab
Added dostrings for utils
jeremyliweishih Sep 3, 2019
0f6c2fe
Removed abs for LR FI
jeremyliweishih Sep 3, 2019
60690cf
Removed comment
jeremyliweishih Sep 3, 2019
5601bc9
Added corrent number_features
jeremyliweishih Sep 3, 2019
b2eea0f
Add another case of testing objective
jeremyliweishih Sep 3, 2019
dfde4fe
Added num_features
jeremyliweishih Sep 3, 2019
a37d533
Added check to default objectives and changed importing'
jeremyliweishih Sep 3, 2019
b73d60e
Moved making pd dataframe into pipelinebase
jeremyliweishih Sep 3, 2019
e2bdf02
Added number fo features for serialization
jeremyliweishih Sep 3, 2019
5fddcf1
Switched to objective_types
jeremyliweishih Sep 3, 2019
bed9355
Lint
jeremyliweishih Sep 3, 2019
42fc1a6
Update docstring
jeremyliweishih Sep 3, 2019
b33b049
Using enum
jeremyliweishih Sep 4, 2019
87130d5
Updated docstrings
jeremyliweishih Sep 4, 2019
56c28f3
Update enum, add handler, and tests
jeremyliweishih Sep 5, 2019
abdeb12
Added handle_problem_type to get_pipeline
jeremyliweishih Sep 5, 2019
11fd6bc
Replaced objective_type with problem_type
jeremyliweishih Sep 5, 2019
092255f
lint
jeremyliweishih Sep 6, 2019
76f6e09
Remove auto
jeremyliweishih Sep 6, 2019
21ae443
Merge branch 'master' into multiclass_pipeline
jeremyliweishih Sep 6, 2019
de3155c
merge issues
jeremyliweishih Sep 6, 2019
8579256
Changed all problem_types to list and adjusted accordingly
jeremyliweishih Sep 9, 2019
bad9ab6
Added supports_problem_types
jeremyliweishih Sep 9, 2019
28abf99
Remove docstring and removed else
jeremyliweishih Sep 10, 2019
da4ad65
Only wrap key error in try
jeremyliweishih Sep 10, 2019
236770a
Use enum
jeremyliweishih Sep 10, 2019
99cd7e4
Switched to problem_type
jeremyliweishih Sep 10, 2019
82cc6c9
Added docstring for handle_problem_types
jeremyliweishih Sep 10, 2019
7510e45
Switched to only singular problem_type
jeremyliweishih Sep 10, 2019
71b2120
Changed objective names and cleanup
jeremyliweishih Sep 10, 2019
92bb8a4
Clarified problem type tests
jeremyliweishih Sep 10, 2019
a9c8fff
Remove casting input as Pandas
jeremyliweishih Sep 10, 2019
e7a348c
Lint
jeremyliweishih Sep 10, 2019
ddf7842
Fix docstring
jeremyliweishih Sep 10, 2019
8a371e9
Fix problem_types docstring
jeremyliweishih Sep 10, 2019
00b49a7
Removed TODO, refer to #61
jeremyliweishih Sep 10, 2019
7c39ae8
Removed zip
jeremyliweishih Sep 10, 2019
9e99182
Moved default_objectives and get_objective to autobase'
jeremyliweishih Sep 10, 2019
646b427
Changed to in docstrings
jeremyliweishih Sep 10, 2019
90ff848
Convert from str
jeremyliweishih Sep 10, 2019
f93f608
Merge branch 'master' into multiclass_pipeline
jeremyliweishih Sep 10, 2019
58ae5d3
Lint and cleanup
jeremyliweishih Sep 10, 2019
fbec76c
Merge branch 'multiclass_pipeline' of https://github.com/FeatureLabs/…
jeremyliweishih Sep 10, 2019
f53ebb1
lint and merge issue
jeremyliweishih Sep 10, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions evalml/models/auto_base.py
Expand Up @@ -14,7 +14,7 @@


class AutoBase:
def __init__(self, problem_types, tuner, cv, objective, max_pipelines, max_time,
def __init__(self, problem_type, tuner, cv, objective, max_pipelines, max_time,
model_types, default_objectives, detect_label_leakage, start_iteration_callback,
add_result_callback, random_state, verbose):
if tuner is None:
Expand All @@ -30,7 +30,7 @@ def __init__(self, problem_types, tuner, cv, objective, max_pipelines, max_time,
self.cv = cv
self.verbose = verbose

self.possible_pipelines = get_pipelines(problem_types=problem_types, model_types=model_types)
self.possible_pipelines = get_pipelines(problem_type=problem_type, model_types=model_types)

self.results = {}
self.trained_pipelines = {}
Expand Down
10 changes: 5 additions & 5 deletions evalml/models/auto_classifier.py
Expand Up @@ -62,19 +62,19 @@ def __init__(self,
cv = StratifiedKFold(n_splits=3, random_state=random_state)

objective = get_objective(objective)
default_objectives = get_objectives('binary')
default_objectives = get_objectives(ProblemTypes.BINARY)
jeremyliweishih marked this conversation as resolved.
Show resolved Hide resolved
problem_type = ProblemTypes.BINARY
if multiclass:
jeremyliweishih marked this conversation as resolved.
Show resolved Hide resolved
default_objectives = get_objectives('multiclass')

problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
default_objectives = get_objectives(ProblemTypes.MULTICLASS)
problem_type = ProblemTypes.MULTICLASS
super().__init__(
tuner=tuner,
objective=objective,
cv=cv,
max_pipelines=max_pipelines,
max_time=max_time,
model_types=model_types,
problem_types=problem_types,
problem_type=problem_type,
default_objectives=default_objectives,
detect_label_leakage=detect_label_leakage,
start_iteration_callback=start_iteration_callback,
Expand Down
6 changes: 3 additions & 3 deletions evalml/models/auto_regressor.py
Expand Up @@ -56,12 +56,12 @@ def __init__(self,
objective = "R2"

objective = get_objective(objective)
default_objectives = get_objectives('regression')
default_objectives = get_objectives(ProblemTypes.REGRESSION)

if cv is None:
cv = KFold(n_splits=3, random_state=random_state)

problem_types = [ProblemTypes.REGRESSION]
problem_type = ProblemTypes.REGRESSION

super().__init__(
tuner=tuner,
Expand All @@ -70,7 +70,7 @@ def __init__(self,
max_pipelines=max_pipelines,
max_time=max_time,
model_types=model_types,
problem_types=problem_types,
problem_type=problem_type,
default_objectives=default_objectives,
detect_label_leakage=detect_label_leakage,
start_iteration_callback=start_iteration_callback,
Expand Down
1 change: 0 additions & 1 deletion evalml/objectives/fraud_cost.py
Expand Up @@ -35,7 +35,6 @@ def __init__(self, retry_percentage=.5, interchange_fee=.02,
def decision_function(self, y_predicted, extra_cols, threshold):
"""Determine if transaction is fraud given predicted probabilities,
dataframe with transaction amount, and threshold"""

transformed_probs = (y_predicted * extra_cols[self.amount_col])
return transformed_probs > threshold

Expand Down
12 changes: 5 additions & 7 deletions evalml/objectives/objective_base.py
Expand Up @@ -14,13 +14,11 @@ class ObjectiveBase:
def __init__(self, verbose=False):
self.verbose = verbose

def supports_problem_types(self, problem_types):
problem_types = handle_problem_types(problem_types)
for problem_type in problem_types:
if problem_type in self.__class__.problem_types:
return True
else:
return False
def supports_problem_type(self, problem_type):
problem_type = handle_problem_types(problem_type)
if problem_type in self.__class__.problem_types:
return True
return False

def fit(self, y_predicted, y_true, extra_cols=None):
"""Learn the objective function based on the predictions from a model.
Expand Down
26 changes: 10 additions & 16 deletions evalml/objectives/standard_metrics.py
Expand Up @@ -23,7 +23,7 @@ class F1Micro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "F1_Micro"
name = "F1 Micro"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand All @@ -34,7 +34,7 @@ class F1Macro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "F1_Macro"
name = "F1 Macro"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand All @@ -45,12 +45,9 @@ class F1Weighted(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "F1_Weighted"
name = "F1 Weighted"
problem_types = [ProblemTypes.MULTICLASS]

def __init__(self, average='binary'):
self.average = average

def score(self, y_predicted, y_true):
return metrics.f1_score(y_true, y_predicted, average='weighted')

Expand All @@ -70,12 +67,9 @@ class PrecisionMicro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "Precision_Micro"
name = "Precision Micro"
problem_types = [ProblemTypes.MULTICLASS]

def __init__(self, average='binary'):
self.average = average

def score(self, y_predicted, y_true):
return metrics.precision_score(y_true, y_predicted, average='micro')

Expand All @@ -84,7 +78,7 @@ class PrecisionMacro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "Precision_Macro"
name = "Precision Macro"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand All @@ -95,7 +89,7 @@ class PrecisionWeighted(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "Precision_Weighted"
name = "Precision Weighted"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand All @@ -117,7 +111,7 @@ class RecallMicro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "Recall_Micro"
name = "Recall Micro"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand All @@ -128,7 +122,7 @@ class RecallMacro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
need_proba = False
name = "Recall_Macro"
name = "Recall Macro"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand Down Expand Up @@ -161,7 +155,7 @@ class AUCMicro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
score_needs_proba = True
name = "AUC_Micro"
name = "AUC Micro"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand All @@ -173,7 +167,7 @@ class AUCMacro(ObjectiveBase):
needs_fitting = False
greater_is_better = True
score_needs_proba = True
name = "AUC_Macro"
name = "AUC Macro"
problem_types = [ProblemTypes.MULTICLASS]

def score(self, y_predicted, y_true):
Expand Down
6 changes: 3 additions & 3 deletions evalml/objectives/utils.py
Expand Up @@ -41,7 +41,7 @@ def get_objective(objective):
return OPTIONS[objective]


def get_objectives(problem_types):
def get_objectives(problem_type):
jeremyliweishih marked this conversation as resolved.
Show resolved Hide resolved
"""Returns all objectives associated with the given problem types

Args:
Expand All @@ -50,5 +50,5 @@ def get_objectives(problem_types):
Returns:
List of Objectives
"""
problem_types = handle_problem_types(problem_types)
return [obj for obj in OPTIONS if OPTIONS[obj].supports_problem_types(problem_types)]
problem_type = handle_problem_types(problem_type)
return [obj for obj in OPTIONS if OPTIONS[obj].supports_problem_type(problem_type)]
3 changes: 0 additions & 3 deletions evalml/pipelines/classification/xgboost.py
Expand Up @@ -57,9 +57,6 @@ def fit(self, X, y, objective_fit_size=.2):

y (pd.Series): the target training labels of length [n_samples]

feature_types (list, optional): list of feature types. either numeric of categorical.
categorical features will automatically be encoded

Returns:

self
Expand Down
13 changes: 5 additions & 8 deletions evalml/pipelines/utils.py
Expand Up @@ -12,12 +12,12 @@
ALL_PIPELINES = [RFClassificationPipeline, XGBoostPipeline, LogisticRegressionPipeline, RFRegressionPipeline]


def get_pipelines(problem_types, model_types=None):
def get_pipelines(problem_type, model_types=None):
"""Returns potential pipelines by model type

Arguments:

problem_types(ProblemTypes/str or list[ProblemTypes/str]): the problem type/s the pipelines work for.
problem_types(ProblemTypes/str): the problem type the pipelines work for.
model_types(list[str]): model types to match. if none, return all pipelines

Returns
Expand All @@ -27,14 +27,11 @@ def get_pipelines(problem_types, model_types=None):
"""

problem_pipelines = []
if not isinstance(problem_types, list):
problem_types = list(problem_types)

problem_types = handle_problem_types(problem_types)
problem_type = handle_problem_types(problem_type)
for p in ALL_PIPELINES:
for problem_type in problem_types:
if problem_type in p.problem_types and p not in problem_pipelines:
problem_pipelines.append(p)
if problem_type in p.problem_types:
problem_pipelines.append(p)

if model_types is None:
return problem_pipelines
Expand Down
34 changes: 19 additions & 15 deletions evalml/problem_types/utils.py
@@ -1,18 +1,22 @@
from .problem_types import ProblemTypes


def handle_problem_types(problem_types):
if isinstance(problem_types, ProblemTypes):
return problem_types
if isinstance(problem_types, str):
problem_types = [problem_types]
types = list()
for problem_type in problem_types:
if isinstance(problem_type, ProblemTypes):
types.append(problem_type)
elif isinstance(problem_type, str):
try:
types.append(ProblemTypes[problem_type.upper()])
except KeyError:
raise KeyError('Problem type \'{}\' does not exist'.format(problem_type))
return types
def handle_problem_types(problem_type):
"""Handles problem_type by either returning the ProblemTypes or converting to a str
jeremyliweishih marked this conversation as resolved.
Show resolved Hide resolved

Args:
problem_types (str/ProblemTypes) : problem type that needs to be handled

Returns:
ProblemType
"""

if isinstance(problem_type, str):
try:
tpe = ProblemTypes[problem_type.upper()]
except KeyError:
raise KeyError('Problem type \'{}\' does not exist'.format(problem_type))
return tpe
if isinstance(problem_type, ProblemTypes):
return problem_type
raise ValueError('`handle_problem_types` was not passed a str or ProblemTypes object')
4 changes: 2 additions & 2 deletions evalml/tests/test_autoclassifier.py
Expand Up @@ -19,7 +19,7 @@ def test_init(X_y):
clf = AutoClassifier(multiclass=False)

# check loads all pipelines
assert get_pipelines(problem_types=[ProblemTypes.BINARY]) == clf.possible_pipelines
assert get_pipelines(problem_type=ProblemTypes.BINARY) == clf.possible_pipelines

clf.fit(X, y)

Expand Down Expand Up @@ -65,7 +65,7 @@ def test_init_select_model_types():
model_types = ["random_forest"]
clf = AutoClassifier(model_types=model_types)

assert get_pipelines(problem_types=[ProblemTypes.BINARY], model_types=model_types) == clf.possible_pipelines
assert get_pipelines(problem_type=ProblemTypes.BINARY, model_types=model_types) == clf.possible_pipelines
assert model_types == clf.possible_model_types


Expand Down
2 changes: 1 addition & 1 deletion evalml/tests/test_autoregressor.py
Expand Up @@ -18,7 +18,7 @@ def test_init(X_y):
clf = AutoRegressor(objective="R2", max_pipelines=3)

# check loads all pipelines
assert get_pipelines(problem_types=[ProblemTypes.REGRESSION]) == clf.possible_pipelines
assert get_pipelines(problem_type=ProblemTypes.REGRESSION) == clf.possible_pipelines

clf.fit(X, y)

Expand Down
11 changes: 4 additions & 7 deletions evalml/tests/test_pipelines.py
Expand Up @@ -21,9 +21,9 @@ def test_list_model_types():


def test_get_pipelines():
assert len(get_pipelines(problem_types=[ProblemTypes.BINARY])) == 3
assert len(get_pipelines(problem_types=[ProblemTypes.BINARY], model_types=["linear_model"])) == 1
assert len(get_pipelines(problem_types=[ProblemTypes.REGRESSION])) == 1
assert len(get_pipelines(problem_type=ProblemTypes.BINARY)) == 3
assert len(get_pipelines(problem_type=ProblemTypes.BINARY, model_types=["linear_model"])) == 1
assert len(get_pipelines(problem_type=ProblemTypes.REGRESSION)) == 1


@pytest.fixture
Expand All @@ -40,8 +40,6 @@ def path_management():

def test_serialization(X_y, trained_model, path_management):
X, y = X_y
X = pd.DataFrame(X)
y = pd.Series(y)
path = os.path.join(path_management, 'pipe.pkl')
objective = Precision()

Expand All @@ -53,8 +51,7 @@ def test_serialization(X_y, trained_model, path_management):

def test_reproducibility(X_y):
X, y = X_y
X = pd.DataFrame(X)
y = pd.Series(y)
X = pd.DataFrame(X) # TODO: FraudCost.decision_function breaks when given np.array(). Need to standardize input as pd or adjust function.
jeremyliweishih marked this conversation as resolved.
Show resolved Hide resolved

objective = FraudCost(
retry_percentage=.5,
Expand Down
33 changes: 17 additions & 16 deletions evalml/tests/test_problem_types.py
Expand Up @@ -4,27 +4,28 @@


@pytest.fixture
def correct_pts():
correct_pts = [[ProblemTypes.REGRESSION], [ProblemTypes.MULTICLASS], [ProblemTypes.BINARY], [ProblemTypes.MULTICLASS, ProblemTypes.BINARY]]
yield correct_pts
def correct_problem_types():
correct_problem_types = [ProblemTypes.REGRESSION, ProblemTypes.MULTICLASS, ProblemTypes.BINARY]
yield correct_problem_types


def test_handle_string(correct_pts):
pts = [['regression'], ['multiclass'], ['binary'], ['multiclass', 'binary']]
for pt in zip(pts, correct_pts):
assert handle_problem_types(pt[0]) == pt[1]
def test_handle_string(correct_problem_types):
problem_types = ['regression', 'multiclass', 'binary']
for problem_type in zip(problem_types, correct_problem_types):
assert handle_problem_types(problem_type[0]) == problem_type[1]

pts = ['fake', 'regression']
problem_type = 'fake'
error_msg = 'Problem type \'fake\' does not exist'
with pytest.raises(KeyError, match=error_msg):
handle_problem_types(pts) == ProblemTypes.regression
handle_problem_types(problem_type) == ProblemTypes.REGRESSION


def test_handle_problemtypes(correct_pts):
for pt in zip(correct_pts, correct_pts):
assert handle_problem_types(pt[0]) == pt[1]
def test_handle_problem_types(correct_problem_types):
for problem_type in zip(correct_problem_types, correct_problem_types):
jeremyliweishih marked this conversation as resolved.
Show resolved Hide resolved
assert handle_problem_types(problem_type[0]) == problem_type[1]

pts = ['fake', 'regression']
error_msg = 'Problem type \'fake\' does not exist'
with pytest.raises(KeyError, match=error_msg):
handle_problem_types(pts) == ProblemTypes.regression

def test_handle_incorrect_type():
error_msg = '`handle_problem_types` was not passed a str or ProblemTypes object'
with pytest.raises(ValueError, match=error_msg):
handle_problem_types(5)