-
Notifications
You must be signed in to change notification settings - Fork 247
/
breast_cancer.py
264 lines (233 loc) · 8.6 KB
/
breast_cancer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
# ----------------------------------------------------------------------------
# Copyright (C) 2021-2023 Deepchecks (https://www.deepchecks.com)
#
# This file is part of Deepchecks.
# Deepchecks is distributed under the terms of the GNU Affero General
# Public License (version 3 or later).
# You should have received a copy of the GNU Affero General Public License
# along with Deepchecks. If not, see <http://www.gnu.org/licenses/>.
# ----------------------------------------------------------------------------
#
"""The data set contains features for binary prediction of breast cancer.
The data has 569 patient records with 30 features and one binary target column, referring to the presence of
breast cancer in the patient.
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett,
“Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and
Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct
a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3
separating planes.
The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [
K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”,
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
References:
* W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor
diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905,
pages 861-870, San Jose, CA, 1993.
* O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
* W.H. Wolberg,
W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle
aspirates. Cancer Letters 77 (1994) 163-171.
The typical ML task in this dataset is to build a model that classifies between benign and malignant samples.
Ten real-valued features are computed for each cell nucleus:
#. radius (mean of distances from center to points on the perimeter)
#. texture (standard deviation of gray-scale values)
#. perimeter
#. area
#. smoothness (local variation in radius lengths)
#. compactness (perimeter^2 / area - 1.0)
#. concavity (severity of concave portions of the contour)
#. concave points (number of concave portions of the contour)
#. symmetry
#. fractal dimension ("coastline approximation" - 1)
Dataset Shape:
.. list-table:: Dataset Shape
:widths: 50 50
:header-rows: 1
* - Property
- Value
* - Samples Total
- 569
* - Dimensionality
- 30
* - Features
- real
* - Targets
- boolean
Description:
.. list-table:: Dataset Description
:widths: 50 50 50
:header-rows: 1
* - mean radius
- Feature
- mean radius
* - mean texture
- Feature
- mean texture
* - mean perimeter
- Feature
- mean perimeter
* - mean area
- Feature
- mean area
* - mean smoothness
- Feature
- mean smoothness
* - mean compactness
- Feature
- mean compactness
* - mean concavity
- Feature
- mean concavity
* - mean concave points
- Feature
- mean concave points
* - mean symmetry
- Feature
- mean symmetry
* - mean fractal dimension
- Feature
- mean fractal dimension
* - radius error
- Feature
- radius error
* - texture error
- Feature
- texture error
* - perimeter error
- Feature
- perimeter error
* - area error
- Feature
- area error
* - smoothness error
- Feature
- smoothness error
* - compactness error
- Feature
- compactness error
* - concavity error
- Feature
- concavity error
* - concave points error
- Feature
- concave points error
* - symmetry error
- Feature
- symmetry error
* - fractal dimension error
- Feature
- fractal dimension error
* - worst radius
- Feature
- worst radius
* - worst texture
- Feature
- worst texture
* - worst perimeter
- Feature
- worst perimeter
* - worst area
- Feature
- worst area
* - worst smoothness
- Feature
- worst smoothness
* - worst compactness
- Feature
- worst compactness
* - worst concavity
- Feature
- worst concavity
* - worst concave points
- Feature
- worst concave points
* - worst symmetry
- Feature
- worst symmetry
* - worst fractal dimension
- Feature
- worst fractal dimension
* - target
- Label
- The class (Benign, Malignant)
"""
import typing as t
from urllib.request import urlopen
import joblib
import pandas as pd
import sklearn
from sklearn.ensemble import AdaBoostClassifier
from deepchecks.tabular.dataset import Dataset
__all__ = ['load_data', 'load_fitted_model']
_MODEL_URL = 'https://figshare.com/ndownloader/files/35122759'
_FULL_DATA_URL = 'https://ndownloader.figshare.com/files/33325472'
_TRAIN_DATA_URL = 'https://ndownloader.figshare.com/files/33325556'
_TEST_DATA_URL = 'https://ndownloader.figshare.com/files/33325559'
_MODEL_VERSION = '1.0.2'
_target = 'target'
_CAT_FEATURES = []
def load_data(data_format: str = 'Dataset', as_train_test: bool = True) -> \
t.Union[t.Tuple, t.Union[Dataset, pd.DataFrame]]:
"""Load and returns the Breast Cancer dataset (classification).
Parameters
----------
data_format : str, default: 'Dataset'
Represent the format of the returned value. Can be 'Dataset'|'Dataframe'
'Dataset' will return the data as a Dataset object
'Dataframe' will return the data as a pandas Dataframe object
as_train_test : bool, default: True
If True, the returned data is splitted into train and test exactly like the toy model
was trained. The first return value is the train data and the second is the test data.
In order to get this model, call the load_fitted_model() function.
Otherwise, returns a single object.
Returns
-------
dataset : Union[deepchecks.Dataset, pd.DataFrame]
the data object, corresponding to the data_format attribute.
train, test : Tuple[Union[deepchecks.Dataset, pd.DataFrame],Union[deepchecks.Dataset, pd.DataFrame]
tuple if as_train_test = True. Tuple of two objects represents the dataset splitted to train and test sets.
"""
if not as_train_test:
dataset = pd.read_csv(_FULL_DATA_URL)
if data_format == 'Dataset':
dataset = Dataset(dataset, label=_target, cat_features=_CAT_FEATURES)
return dataset
elif data_format == 'Dataframe':
return dataset
else:
raise ValueError('data_format must be either "Dataset" or "Dataframe"')
else:
train = pd.read_csv(_TRAIN_DATA_URL)
test = pd.read_csv(_TEST_DATA_URL)
if data_format == 'Dataset':
train = Dataset(train, label=_target, cat_features=_CAT_FEATURES)
test = Dataset(test, label=_target, cat_features=_CAT_FEATURES)
return train, test
elif data_format == 'Dataframe':
return train, test
else:
raise ValueError('data_format must be either "Dataset" or "Dataframe"')
def load_fitted_model(pretrained=True):
"""Load and return a fitted classification model to predict the flower type in the iris dataset.
Returns
-------
model : Joblib
The model/pipeline that was trained on the iris dataset.
"""
if sklearn.__version__ == _MODEL_VERSION and pretrained:
with urlopen(_MODEL_URL) as f:
model = joblib.load(f)
else:
model = _build_model()
train, _ = load_data()
model.fit(train.data[train.features], train.data[train.label_name])
return model
def _build_model():
"""Build the model to fit."""
return AdaBoostClassifier(random_state=0)