/
Error.html
528 lines (522 loc) · 44 KB
/
Error.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
<div id="ipython-notebook">
<a class="interact-button" href="http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/little_women.csv&path=notebooks/faithful.csv&path=notebooks/sat2014.csv&path=notebooks/Error.ipynb">Interact</a>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$']],
processEscapes: true
}
});
</script>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Error-in-the-regression-estimate">Error in the regression estimate<a class="anchor-link" href="#Error-in-the-regression-estimate">¶</a></h3><p>Though the average residual is 0, each individual residual is not. Some residuals might be quite far from 0. To get a sense of the amount of error in the regression estimate, we will start with a graphical description of the sense in which the regression line is the "best".</p>
<p>Our example is a dataset that has one point for every chapter of the novel "Little Women." The goal is to estimate the number of characters (that is, letters, punctuation marks, and so on) based on the number of periods. Recall that we attempted to do this in the very first lecture of this course.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">little_women</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'little_women.csv'</span><span class="p">)</span>
<span class="n">little_women</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Characters</th> <th>Periods</th>
</tr>
</thead>
<tbody>
<tr>
<td>21759 </td> <td>189 </td>
</tr>
</tbody>
<tbody><tr>
<td>22148 </td> <td>188 </td>
</tr>
</tbody>
<tbody><tr>
<td>20558 </td> <td>231 </td>
</tr>
</tbody>
</table>
<p>... (44 rows omitted)</p></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># One point for each chapter</span>
<span class="c1"># Horizontal axis: number of periods</span>
<span class="c1"># Vertical axis: number of characters (as in a, b, ", ?, etc; not people in the book)</span>
<span class="n">little_women</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">'Periods'</span><span class="p">,</span> <span class="s1">'Characters'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_4_0.png"/></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">little_women</span><span class="p">,</span> <span class="s1">'Periods'</span><span class="p">,</span> <span class="s1">'Characters'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.92295768958548163</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The scatter plot is remarkably close to linear, and the correlation is more than 0.92.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The figure below shows the scatter plot and regression line, with four of the errors marked in red.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Residuals: Deviations from the regression line</span>
<span class="n">lw_slope</span> <span class="o">=</span> <span class="n">slope</span><span class="p">(</span><span class="n">little_women</span><span class="p">,</span> <span class="s1">'Periods'</span><span class="p">,</span> <span class="s1">'Characters'</span><span class="p">)</span>
<span class="n">lw_intercept</span> <span class="o">=</span> <span class="n">intercept</span><span class="p">(</span><span class="n">little_women</span><span class="p">,</span> <span class="s1">'Periods'</span><span class="p">,</span> <span class="s1">'Characters'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Slope: '</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="n">lw_slope</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Intercept:'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="n">lw_intercept</span><span class="p">))</span>
<span class="n">lw_errors</span><span class="p">(</span><span class="n">lw_slope</span><span class="p">,</span> <span class="n">lw_intercept</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Slope: 87.0
Intercept: 4745.0
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_9_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Had we used a different line to create our estimates, the errors would have been different. The picture below shows how big the errors would be if we were to use a particularly silly line for estimation.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Errors: Deviations from a different line</span>
<span class="n">lw_errors</span><span class="p">(</span><span class="o">-</span><span class="mi">100</span><span class="p">,</span> <span class="mi">50000</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_11_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Below is a line that we have used before without saying that we were using a line to create estimates. It is the horizontal line at the value "average of $y$." Suppose you were asked to estimate $y$ and <em>were not told the value of $x$</em>; then you would use the average of $y$ as your estimate, regardless of the chapter. In other words, you would use the flat line below.</p>
<p>Each error that you would make would then be a deviation from average. The rough size of these deviations is the SD of $y$.</p>
<p>In summary, if we use the flat line at the average of $y$ to make our estimates, the estimates will be off by the SD of $y$.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Errors: Deviations from the flat line at the average of y</span>
<span class="n">characters_average</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">little_women</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Characters'</span><span class="p">))</span>
<span class="n">lw_errors</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">characters_average</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_13_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Least-Squares">Least Squares<a class="anchor-link" href="#Least-Squares">¶</a></h3><p>If you use any arbitrary line as your line of estimates, then some of your errors are likely to be positive and others negative. To avoid cancellation when measuring the rough size of the errors, we take the mean of the sqaured errors rather than the mean of the errors themselves. This is exactly analogous to our reason for looking at squared deviations from average, when we were learning how to calculate the SD.</p>
<p>The mean squared error of estimation using a straight line is a measure of roughly how big the squared errors are; taking the square root yields the root mean square error, which is in the same units as $y$.</p>
<p>Here is a remarkable fact of mathematics: the regression line minimizes the mean squared error of estimation (and hence also the root mean squared error) among all straight lines. That is why the regression line is sometimes called the "least squares line."</p>
<p><strong>Computing the "best" line.</strong></p>
<ul>
<li>To get estimates of $y$ based on $x$, you can use any line you want.</li>
<li>Every line has a mean squared error of estimation.</li>
<li>"Better" lines have smaller errors.</li>
<li><strong>The regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines.</strong></li>
</ul>
<p>We can define a function to compute the root mean squared error of any line throught the Little Women scatter diagram.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">lw_rmse</span><span class="p">(</span><span class="n">slope</span><span class="p">,</span> <span class="n">intercept</span><span class="p">):</span>
<span class="n">lw_errors</span><span class="p">(</span><span class="n">slope</span><span class="p">,</span> <span class="n">intercept</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">little_women</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Periods'</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">little_women</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Characters'</span><span class="p">)</span>
<span class="n">fitted</span> <span class="o">=</span> <span class="n">slope</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">intercept</span>
<span class="n">mse</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">average</span><span class="p">((</span><span class="n">y</span> <span class="o">-</span> <span class="n">fitted</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Root mean squared error:"</span><span class="p">,</span> <span class="n">mse</span> <span class="o">**</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">lw_rmse</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">characters_average</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Root mean squared error: 7019.17593405
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_15_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The error of the regression line is indeed much smaller if we choose a slope and intercept near the regression line.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">lw_rmse</span><span class="p">(</span><span class="mi">90</span><span class="p">,</span> <span class="mi">4000</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Root mean squared error: 2715.53910638
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_17_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>But the minimum is achieved using the regression line itself.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">lw_rmse</span><span class="p">(</span><span class="n">lw_slope</span><span class="p">,</span> <span class="n">lw_intercept</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Root mean squared error: 2701.69078531
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_19_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Numerical-Optimization">Numerical Optimization<a class="anchor-link" href="#Numerical-Optimization">¶</a></h2></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We can also define <code>mean_squared_error</code> for an arbitrary data set. We'll use a higher-order function so that we can try many different lines on the same data set, simply by passing in their slope and intercept.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">mean_squared_error</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">for_line</span><span class="p">(</span><span class="n">slope</span><span class="p">,</span> <span class="n">intercept</span><span class="p">):</span>
<span class="n">fitted</span> <span class="o">=</span> <span class="p">(</span><span class="n">slope</span> <span class="o">*</span> <span class="n">table</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">intercept</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">average</span><span class="p">((</span><span class="n">table</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">-</span> <span class="n">fitted</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">for_line</span>
</pre></div></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The <a href="">Old Faithful</a> geyser in Yellowstone national park erupts regularly, but the <em>waiting</em> time between eruptions (in seconds) and the duration of the <em>eruptions</em> in secconds do vary.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">faithful</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'faithful.csv'</span><span class="p">)</span>
<span class="n">faithful</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_24_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It appears that there are two types of eruptions, short and long. The eruptions above 3 seconds appear correlated with waiting time.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">long</span> <span class="o">=</span> <span class="n">faithful</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">faithful</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'eruptions'</span><span class="p">)</span> <span class="o">></span> <span class="mi">3</span><span class="p">)</span>
<span class="n">long</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">fit_line</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_26_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The <code>rmse_geyser</code> function takes a slope and an intercept and returns the root mean squared error of a linear predictor of eruption size from the waiting time.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mse_long</span> <span class="o">=</span> <span class="n">mean_squared_error</span><span class="p">(</span><span class="n">long</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">mse_long</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.50268461001194098</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>If we experiment with different values, we can find a low-error slope and intercept through trial and error.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mse_long</span><span class="p">(</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">3.5</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.39143872353883735</pre></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mse_long</span><span class="p">(</span><span class="mf">0.02</span><span class="p">,</span> <span class="mf">2.7</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.38168519564089542</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The <code>minimize</code> function can be used to find the arguments of a function for which the function returns its minimum value. Python uses a similar trial-and-error approach, following the changes that lead to incrementally lower output values.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">minimize</span><span class="p">(</span><span class="n">mse_long</span><span class="p">)</span>
</pre></div></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The root mean squared error of the minimal slope and intercept is smaller than any of the values we considered before.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mse_long</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.38014612988504093</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In fact, these values <code>a</code> and <code>b</code> are the same as the values returned by the <code>slope</code> and <code>intercept</code> functions we developed based on the correlation coefficient. We see small deviations due to the inexact nature of <code>minimize</code>, but the values are essentially the same.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="s2">"slope: "</span><span class="p">,</span> <span class="n">slope</span><span class="p">(</span><span class="n">long</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"a: "</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"intecept:"</span><span class="p">,</span> <span class="n">intercept</span><span class="p">(</span><span class="n">long</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"b: "</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>slope: 0.0255508940715
a: 0.0255496
intecept: 2.24752334164
b: 2.247629
</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Therefore, we have found not only that the regression line minimizes mean squared error, but also that minimizing mean squared error gives us the regression line. The regression line is the only line that minimizes mean squared error.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Residuals">Residuals<a class="anchor-link" href="#Residuals">¶</a></h2><p>The amount of error in each of these regression estimates is the difference between the son's height and its estimate. These errors are called <em>residuals</em>. Some residuals are positive. These correspond to points that are above the regression line – points for which the regression line under-estimates $y$. Negative residuals correspond to the line over-estimating values of $y$.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">waiting</span> <span class="o">=</span> <span class="n">long</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'waiting'</span><span class="p">)</span>
<span class="n">eruptions</span> <span class="o">=</span> <span class="n">long</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'eruptions'</span><span class="p">)</span>
<span class="n">fitted</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">waiting</span> <span class="o">+</span> <span class="n">b</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">long</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
<span class="s1">'eruptions (fitted)'</span><span class="p">,</span> <span class="n">fitted</span><span class="p">,</span>
<span class="s1">'residuals'</span><span class="p">,</span> <span class="n">eruptions</span> <span class="o">-</span> <span class="n">fitted</span>
<span class="p">])</span>
<span class="n">res</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>eruptions</th> <th>waiting</th> <th>eruptions (fitted)</th> <th>residuals</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.6 </td> <td>79 </td> <td>4.26605 </td> <td>-0.666047 </td>
</tr>
</tbody>
<tbody><tr>
<td>3.333 </td> <td>74 </td> <td>4.1383 </td> <td>-0.805299 </td>
</tr>
</tbody>
<tbody><tr>
<td>4.533 </td> <td>85 </td> <td>4.41934 </td> <td>0.113655 </td>
</tr>
</tbody>
<tbody><tr>
<td>4.7 </td> <td>88 </td> <td>4.49599 </td> <td>0.204006 </td>
</tr>
</tbody>
<tbody><tr>
<td>3.6 </td> <td>85 </td> <td>4.41934 </td> <td>-0.819345 </td>
</tr>
</tbody>
<tbody><tr>
<td>4.35 </td> <td>85 </td> <td>4.41934 </td> <td>-0.069345 </td>
</tr>
</tbody>
<tbody><tr>
<td>3.917 </td> <td>84 </td> <td>4.3938 </td> <td>-0.476795 </td>
</tr>
</tbody>
<tbody><tr>
<td>4.2 </td> <td>78 </td> <td>4.2405 </td> <td>-0.0404978</td>
</tr>
</tbody>
<tbody><tr>
<td>4.7 </td> <td>83 </td> <td>4.36825 </td> <td>0.331754 </td>
</tr>
</tbody>
<tbody><tr>
<td>4.8 </td> <td>84 </td> <td>4.3938 </td> <td>0.406205 </td>
</tr>
</tbody>
</table>
<p>... (165 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As with deviations from average, the positive and negative residuals exactly cancel each other out. So the average (and sum) of the residuals is 0. Because we found <code>a</code> and <code>b</code> by numerically minimizing the root mean squared error rather than computing them exactly, the sum of residuals is slightly different from zero in this case.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">sum</span><span class="p">(</span><span class="n">res</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'residuals'</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-0.00037579999996140145</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>A residual plot is a scatter plot of the residuals versus the fitted values. The residual plot of a good regression looks like the one below: a formless cloud with no pattern, centered around the horizontal axis. It shows that there is no discernible non-linear pattern in the original scatter plot.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">res</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">fit_line</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_44_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>By contrast, suppose we had attempted to fit a regression line to the entire set of eruptions of Old Faithful in the original dataset.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">faithful</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">fit_line</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_46_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This line does not pass through the average value of vertical strips. We may be able to see this fact from the scatter alone, but it is dramatically clear by visualizing the residual scatter.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">residual_plot</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">fitted</span> <span class="o">=</span> <span class="n">fit</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">residuals</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">-</span> <span class="n">fitted</span>
<span class="n">res_table</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
<span class="s1">'fitted'</span><span class="p">,</span> <span class="n">fitted</span><span class="p">,</span>
<span class="s1">'residuals'</span><span class="p">,</span> <span class="n">residuals</span><span class="p">])</span>
<span class="n">res_table</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">fit_line</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">residual_plot</span><span class="p">(</span><span class="n">faithful</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_48_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The diagonal stripes show a non-linear pattern in the data. In this case, we would not expect the regression line to provide low-error estimates.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Example:-SAT-Scores">Example: SAT Scores<a class="anchor-link" href="#Example:-SAT-Scores">¶</a></h2></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Residual plots can be useful for spotting non-linearity in the data, or other features that weaken the regression analysis. For example, consider the SAT data of the previous section, and suppose you try to estimate the <code>Combined</code> score based on <code>Participation Rate</code>.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">sat</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'sat2014.csv'</span><span class="p">)</span>
<span class="n">sat</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>State</th> <th>Participation Rate</th> <th>Critical Reading</th> <th>Math</th> <th>Writing</th> <th>Combined</th>
</tr>
</thead>
<tbody>
<tr>
<td>North Dakota</td> <td>2.3 </td> <td>612 </td> <td>620 </td> <td>584 </td> <td>1816 </td>
</tr>
</tbody>
<tbody><tr>
<td>Illinois </td> <td>4.6 </td> <td>599 </td> <td>616 </td> <td>587 </td> <td>1802 </td>
</tr>
</tbody>
<tbody><tr>
<td>Iowa </td> <td>3.1 </td> <td>605 </td> <td>611 </td> <td>578 </td> <td>1794 </td>
</tr>
</tbody>
<tbody><tr>
<td>South Dakota</td> <td>2.9 </td> <td>604 </td> <td>609 </td> <td>579 </td> <td>1792 </td>
</tr>
</tbody>
<tbody><tr>
<td>Minnesota </td> <td>5.9 </td> <td>598 </td> <td>610 </td> <td>578 </td> <td>1786 </td>
</tr>
</tbody>
<tbody><tr>
<td>Michigan </td> <td>3.8 </td> <td>593 </td> <td>610 </td> <td>581 </td> <td>1784 </td>
</tr>
</tbody>
<tbody><tr>
<td>Wisconsin </td> <td>3.9 </td> <td>596 </td> <td>608 </td> <td>578 </td> <td>1782 </td>
</tr>
</tbody>
<tbody><tr>
<td>Missouri </td> <td>4.2 </td> <td>595 </td> <td>597 </td> <td>579 </td> <td>1771 </td>
</tr>
</tbody>
<tbody><tr>
<td>Wyoming </td> <td>3.3 </td> <td>590 </td> <td>599 </td> <td>573 </td> <td>1762 </td>
</tr>
</tbody>
<tbody><tr>
<td>Kansas </td> <td>5.3 </td> <td>591 </td> <td>596 </td> <td>566 </td> <td>1753 </td>
</tr>
</tbody>
</table>
<p>... (41 rows omitted)</p></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">sat</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">'Participation Rate'</span><span class="p">,</span> <span class="s1">'Combined'</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_53_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The relation between the variables is clearly non-linear, but you might be tempted to fit a straight line anyway, especially if you never looked at a scatter diagram of the data.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">sat</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">'Participation Rate'</span><span class="p">,</span> <span class="s1">'Combined'</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">fit_line</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_55_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The points in the scatter plot start out above the regression line, then are consistently below the line, then above, then below. This pattern of non-linearity is more clearly visible in the residual plot.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">residual_plot</span><span class="p">(</span><span class="n">sat</span><span class="p">,</span> <span class="s1">'Participation Rate'</span><span class="p">,</span> <span class="s1">'Combined'</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Residual plot of a bad regression'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Error_57_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This residual plot is not a formless cloud; it shows a non-linear pattern, and is a signal that linear regression should not have been used for these data.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="The-Size-of-Residuals">The Size of Residuals<a class="anchor-link" href="#The-Size-of-Residuals">¶</a></h3><p>We will conclude by observing several relationships between quantities we have already identified. We'll illustrate the relationships using the <code>long</code> eruptions of Old Faithful, but they indeed hold for any collection of paired observations.</p>
<p><strong>Fact 1:</strong> The root mean squared error of a line that has zero slope and an intercept at the average of <code>y</code> is the standard deviation of <code>y</code>.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mse_long</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">average</span><span class="p">(</span><span class="n">eruptions</span><span class="p">))</span> <span class="o">**</span> <span class="mf">0.5</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.40967604587437784</pre></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.40967604587437784</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>By contrast, the standard deviation of fitted values is smaller.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">eruptions_fitted</span> <span class="o">=</span> <span class="n">fit</span><span class="p">(</span><span class="n">long</span><span class="p">,</span> <span class="s1">'waiting'</span><span class="p">,</span> <span class="s1">'eruptions'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions_fitted</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.15271994814407569</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Fact 2</strong>: The ratio of the standard deviations of fitted values to <code>y</code> values is $|r|$.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">r</span> <span class="o">=</span> <span class="n">correlation</span><span class="p">(</span><span class="n">long</span><span class="p">,</span> <span class="s1">'waiting'</span><span class="p">,</span> <span class="s1">'eruptions'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'r: '</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'ratio: '</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions_fitted</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>r: 0.372782225571
ratio: 0.372782225571
</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Notice the absolute value of $r$ in the formula above. For the heights of fathers and sons, the correlation is positive and so there is no difference between using $r$ and using its absolute value. However, the result is true for variables that have negative correlation as well, provided we are careful to use the absolute value of $r$ instead of $r$.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The regression line does a better job of estimating eruption lengths than a zero-slope line. Thus, the rough size of the errors made using the regression line must be smaller that that using a flat line. In other words, the SD of the residuals must be smaller than the overall SD of $y$.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">eruptions_residuals</span> <span class="o">=</span> <span class="n">eruptions</span> <span class="o">-</span> <span class="n">eruptions_fitted</span>
<span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions_residuals</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.38014612980028634</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Fact 3</strong>: The SD of the residuals is $\sqrt{1-r^2}$ times the SD of $y$.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">r</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.38014612980028639</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Fact 4:</strong> The variance of the fitted values plus the variance of the residuals gives the variance of <code>y</code>.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions_fitted</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions_residuals</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.16783446256326531</pre></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">eruptions</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.16783446256326534</pre></div></div>