-
Notifications
You must be signed in to change notification settings - Fork 63
/
index.html
408 lines (397 loc) · 17 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>Train,Test,Validation</title>
<meta name="description" content="Double Descent" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- <link rel="stylesheet" href="css/styles.scss" /> -->
<link rel="stylesheet" href="sass/main.scss" />
</head>
<body>
<main></main>
<!-- nav -->
<header>
<nav>
<span
class="cardSpan"
onclick="location.href='https://mla.corp.amazon.com/explain/';"
>
<span class="icon">
<object data="robot.svg" type="image/svg+xml"></object>
</span>
<span class="text"
><h2>MLU-EXPL<span id="ai">AI</span>N</h2></span
>
</span>
<ul id="toc">
<li>
<a data-page="intro" href="#intro">Introduction</a>
</li>
<li>
<a data-page="split" href="#split">The Split</a>
</li>
<li>
<a data-page="train" href="#train">Train Set</a>
</li>
<li>
<a data-page="model" href="#model">Model</a>
</li>
<li>
<a data-page="validation" href="#validation">Validation Set</a>
</li>
<!-- <li>
<a data-page="error" href="#error">Measuring Error</a>
</li> -->
<li>
<a data-page="test" href="#test">Test Set</a>
</li>
<li>
<a data-page="all" href="#all">Summary</a>
</li>
<!-- <li><a data-page="model" href="#model">model</a></li> -->
<div class="bubble"></div>
</ul>
</nav>
</header>
<!-- main content -->
<div id="scrolly">
<section data-index="0" class="intro-mobile" id="intro-mobile">
<h2>The Importance of Data Splitting <br /></h2>
<p>
By
<a href="https://phonetool.amazon.com/users/wilberjw">Jared Wilber</a>
&
<a href="https://phonetool.amazon.com/users/bwernes">Brent Werness</a
>.<br /><br /><br /><br />
In most supervised machine learning tasks, best practice recommends to
split your data into three independent sets: a
<span id="annotation1"><b>training set</b></span
>, a <span id="annotation2"><b>testing set</b></span
>, and a <span id="annotation3"><b>validation set</b></span
>. <br /><br />
To demo the reasons for splitting data in this manner, we will pretend
that we have a dataset made of pets of the following two types:
<br /><br />
<span class="center"
><span class="b">Cats</span>:
<img
width="1.5rem"
height="auto"
class="inline-svg"
src="./assets/noun_Cat_18061.svg" />
<span class="b">Dogs</span>:
<img
width="1.5rem"
height="auto"
class="inline-svg"
src="./assets/noun_Dog_79225.svg"
/></span>
<br /><br />
For each pet in the dataset we know only two features:
<span class="b">weight</span> and <span class="b">fluffiness</span>.
<br /><br />
Our goal is to make use of the different data splits and identify a
best model for classifying a given pet as either a cat or a dog, based
on the available features.
</p>
</section>
<figure>
<div id="main-wrapper">
<div class="button-container">
<p>Select feature:</p>
<button class="button" value="neither">None</button>
<button class="button active" value="weight">Weight</button>
<button class="button" value="fluffiness">Fluffiness</button>
<button class="button" value="both">Both</button>
</div>
<div id="chart-wrapper">
<div id="chart"></div>
<div id="table"></div>
</div>
</div>
<!-- @end #main-wrapper -->
</figure>
<article>
<section data-index="0" class="intro" id="intro">
<h2>The Importance of Data Splitting <br /></h2>
<p>
By
<a href="https://phonetool.amazon.com/users/wilberjw"
>Jared Wilber</a
>
&
<a href="https://phonetool.amazon.com/users/bwernes"
>Brent Werness</a
>.<br /><br /><br /><br />
In most supervised machine learning tasks, best practice recommends
to split your data into three independent sets: a
<span id="annotation1-intro"><b>training set</b></span
>, a <span id="annotation2-intro"><b>testing set</b></span
>, and a <span id="annotation3-intro"><b>validation set</b></span
>. <br /><br />
To see why, let's pretend that we have a dataset of two types of
pets: <br /><br />
<span class="center"
><span class="b">Cats</span>:
<img
width="1.5rem"
height="auto"
class="inline-svg"
src="./assets/noun_Cat_18061.svg" />
<span class="b">Dogs</span>:
<img
width="1.5rem"
height="auto"
class="inline-svg"
src="./assets/noun_Dog_79225.svg"
/></span>
<br /><br />
For each pet in our dataset, we have two features:
<span class="bold">weight</span> and
<span class="bold">fluffiness</span>. <br /><br />
Our goal is to identify and evaluate suitable models for classifying
a given pet as either a cat or a dog. We'll use
train/test/validations splits to do this!
</p>
</section>
<section data-index="1" class="split" id="split">
<h2>Train, Test, and Validation Splits</h2>
<p>
The first step in our classification task is to randomly
<!-- <span
class="tooltip-item"
>[*]<span class="tooltiptext">
The random assignment to each split is important to ensure that
each independent dataset is as representative and free from
confounding factors as possible
</span></span
> -->
split our pets into three independent sets:
<br /><br />
<span id="annotation4"><span class="bold">Training Set</span></span
>:
<span class="item-text"
>The dataset that we feed our model to learn potential underlying
patterns and relationships.</span
><br /><br />
<span id="annotation5"
><span class="bold">Validation Set</span></span
>:
<span class="item-text"
>The dataset that we use to understand our model's performance
across different model types and parameter choices.</span
><br /><br />
<span id="annotation6"><span class="bold">Test Set</span></span
>:
<span class="item-text"
>The dataset that we use to approximate our model's accuracy in
the wild.</span
>
</p>
</section>
<section data-index="2" class="train" id="train">
<h2>The Training Set</h2>
<p>
<span id="annotation7"
>The training set is the dataset that we employ to train our
model.</span
>
It is this dataset that our model uses to learn every potential
underlying pattern or relationship that will enable making
predictions later on. <br /><br />
Since our model learns from it, it is very important that the
training set be as representative as possible
<!-- <span
class="tooltip-item"
>[*]<span class="tooltiptext">
<a href="https://arxiv.org/abs/2103.03399"
>This paper explores the importance of diverse and
representative training data for machine learning
performance</a
></span
></span
> -->
of the population that we are trying to model. <br /><br />
Additionally, we need to be careful and ensure that it is as
unbiased as possible, as any bias at this stage will be propagated
downstream
<!-- <span class="tooltip-item"
>[*]<span class="tooltiptext">
<a href="https://arxiv.org/abs/1901.10002"
>Downstream harms occur across many machine learning models</a
>. Examples abound, including such diverse areas as
<a
href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing"
>criminal sentencing</a
>
and
<a href="https://www.nature.com/articles/d41586-019-03228-6"
>health-care</a
>
</span></span
> -->
during inference. <br /><br />
To give our model as much information to learn from as possible, we
typically assign the majority (e.g. 60–80%) of our original
data to the training set.
</p>
</section>
<section data-index="3" class="model" id="model">
<h2>Building Our Model</h2>
<p>
Our goal is to determine whether a given pet is a cat or a dog. This
is a binary classification task, so we will use a simple but
effective model appropriate for this task:
<a href="https://www.youtube.com/watch?v=jhJwNpidiqM"
><span class="bold">logistic regression</span></a
>. <br /><br />
Given a particular combination of the available features (<span
class="bold"
>None</span
>, <span class="bold">Weight</span>,
<span class="bold">Fluffiness</span>, or <i>both</i>
<span class="bold">Weight</span> and
<span class="bold">Fluffiness</span>), the logistic regression
classifier will generate a decision boundary to partition the pets
into either cats or dogs. Each pet will be classified based on its
position relative to the decision boundary: one side for dogs, the
other for cats. <br /><br />
Select the feature set of your choice to visualize the corresponding
logistic regression model's decision boundary. <br /><br /><b
>Drag each animal in the training set to a new position to see how
the boundary updates!</b
>
</p>
</section>
<section data-index="4" class="validation" id="validation">
<h2>The Validation Set</h2>
<p>
For logistic regression, we can build four different classifiers
— one for each choice of features to be included in the model:
none
<!-- <span class="tooltip-item"
>[*]<span class="tooltiptext"> i.e. majority vote</span></span
> -->
, just weight, just fluffiness, or both weight and fluffiness.
<br /><br />
<b>How should we decide which model to select?</b> <br /><br />
We could compare the accuracy of each model on the training set, but
doing so will result in a biased outcome — if we use the same
exact dataset for both training and tuning, the model will overfit
and will not generalize well beyond that dataset.<br /><br />
<b>This is where the validation set comes in</b> —
<span id="annotation8"
>it acts as an independent, unbiased dataset for comparing the
performance of different algorithm and hyperparameter choices
trained on our training set.</span
>
<br /><br />
In our case, we are only looking at one algorithm (logistic
regression), but we can view the number of considered features as a
hyperparameter that we would like to tune<span class="tooltip-item"
>[*]<span class="tooltiptext">
For expository purposes, we won't worry about other choices for
our logistic regression model, such as specific solvers or
regularization penalties</span
></span
>.<br /><br />
Select a feature to view the model's performance on the validation
set in the table below.
<br /><br />
<b
>Drag the feature across the line to see how the performance
updates!</b
>
</p>
</section>
<section data-index="5" class="test" id="test">
<h2>The Testing Set</h2>
<p>
So, assuming you did not go too crazy moving the pets around, our
best model (according to the validation set) is the model that takes
both features into account. <br /><br />
Once we have used the validation set to determine the algorithm and
parameter choices that we would like to use in production, the test
set is used to approximate the models's true performance in the
wild.
<br /><br />Let's repeat that again:
<span id="annotation9"
>the test set is used as the final step in evaluating our model's
performance on unseen data.</span
>
<br /><br />
<b
>We should never, under any circumstance, look at the test set's
performance before selecting a model.</b
>
<br /><br />
The role of model selection and parameter tuning is strictly for the
validation set.
<br /><br />
Peeking at our test set performance ahead of time is a form of
overfitting, and will likely lead to unreliable performance
expectations in production. It should only be checked as the final
form of evaluation, after the validation set has been used to
identify the best model.
</p>
</section>
<section data-index="6" class="all" id="all">
<h2>Summary</h2>
<p>
You may have noticed that the test accuracy of the "just fluffiness"
model was higher than that of the "both features" model, despite the
validation set selecting the latter model as the best. This
occurrence of the validation performance not exactly matching the
test performance might happen, yet it is not a bad thing. Remember
that the test performance is not a number to optimize over —
it is a metric to assess future performance. It allows us to
estimate, with confidence, that our model can distinguish between
cats and dogs with 87.5% accuracy.
<br /><br />
<span class="bold">Key Takeaways</span>
<br /><br />
It is best practice in machine learning to split our data into the
following three groups:
<br /><br />
<span id="annotation10"
><span class="bold">Training Set</span>:
<span class="item-text">For training of the model.</span></span
>
<br /><br />
<span id="annotation11">
<span class="bold">Validation Set</span>:
<span class="item-text"
>For unbiased evaluation of the model.</span
></span
><br /><br />
<span id="annotation12"
><span class="bold">Test Set</span>:
<span class="item-text"
>For final evaluation of the model.</span
></span
>
<br /><br />
Adhering to this setup helps ensure that we have a realistic
understanding of our model's performance, and that we (hopefully)
built a model that generalizes well to unseen data.
<br /><br />
<span class="center">🐾</span>
<br /><br />
Thanks for reading! To learn more about machine learning, check out
<a href="https://aws.amazon.com/machine-learning/mlu/"
>our website</a
>, watch
<a href="https://www.youtube.com/channel/UC12LqyqTQYbXatYS9AA7Nuw"
>our videos</a
>, or read the <a href="https://d2l.ai/"> D2L book</a>. <br /><br />
</p>
</section>
<p id="end-p"></p>
</article>
</div>
<script src="./js/index.js"></script>
<script src="./js/roughAnnotations.js"></script>
</body>
</html>