/
Spread.html
592 lines (586 loc) · 37.3 KB
/
Spread.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
<div id="ipython-notebook">
<a class="interact-button" href="http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/airline_ontime.csv&path=notebooks/nba2013.csv&path=notebooks/Spread.ipynb">Interact</a>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$']],
processEscapes: true
}
});
</script>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Variability">Variability<a class="anchor-link" href="#Variability">¶</a></h1></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="The-Rough-Size-of-Deviations-from-Average">The Rough Size of Deviations from Average<a class="anchor-link" href="#The-Rough-Size-of-Deviations-from-Average">¶</a></h3><p>As we have seen, inference based on test statistics must take into account the way in which the statistics vary across samples. It is therefore important to be able to quantify the variability in any list of numbers. One way is to create a measure of the difference between the values in the list and the average of the list.</p>
<p>We will start by defining such a measure in the context of a simple array of just four <code>numbers</code>.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">numbers</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="n">numbers</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>array([ 1, 2, 2, 10])</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The goal is to get a measure of roughly how far off the numbers in the list are from the average. To do this, we need the average, and all the deviations from the average. A "deviation from average" is just a value minus the average.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Step 1. The average.</span>
<span class="n">average</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">numbers</span><span class="p">)</span>
<span class="n">average</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.75</pre></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Step 2. The deviations from average.</span>
<span class="n">deviations</span> <span class="o">=</span> <span class="n">numbers</span> <span class="o">-</span> <span class="n">average</span>
<span class="n">numbers_and_deviations</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
<span class="s1">'value'</span><span class="p">,</span> <span class="n">numbers</span><span class="p">,</span>
<span class="s1">'deviation from average'</span><span class="p">,</span> <span class="n">deviations</span>
<span class="p">])</span>
<span class="n">numbers_and_deviations</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>value</th> <th>deviation from average</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 </td> <td>-2.75 </td>
</tr>
</tbody>
<tbody><tr>
<td>2 </td> <td>-1.75 </td>
</tr>
</tbody>
<tbody><tr>
<td>2 </td> <td>-1.75 </td>
</tr>
</tbody>
<tbody><tr>
<td>10 </td> <td>6.25 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Some of the deviations are negative; those correspond to values that are below average. Positive deviations correspond to above-average values.</p>
<p>To calculate roughly how big the deviations are, it is natural to compute the mean of the deviations. But something interesting happens when all the deviations are added together:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">sum</span><span class="p">(</span><span class="n">deviations</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.0</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The positive deviations exactly cancel out the negative ones. This is true of all lists of numbers, no matter what the histogram of the list looks like: the sum of the deviations from average is zero.</p>
<p>Since the sum of the deviations is 0, the mean of the deviations will be 0 as well:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">deviations</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.0</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Because of this, the mean of the deviations is not a useful measure of the size of the deviations. What we really want to know is roughly how big the deviations are, regardless of whether they are positive or negative. So we need a way to eliminate the signs of the deviations.</p>
<p>There are two time-honored ways of losing signs: the absolute value, and the square. It turns out that taking the square constructs a measure with extremely powerful properties, some of which we will study in this course.</p>
<p>So let us eliminate the signs by squaring all the deviations. Then we will take the mean of the squares:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Step 3. The squared deviations from average</span>
<span class="n">squared_deviations</span> <span class="o">=</span> <span class="n">deviations</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">numbers_and_deviations</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">'squared deviations from average'</span><span class="p">,</span> <span class="n">squared_deviations</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>value</th> <th>deviation from average</th> <th>squared deviations from average</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 </td> <td>-2.75 </td> <td>7.5625 </td>
</tr>
</tbody>
<tbody><tr>
<td>2 </td> <td>-1.75 </td> <td>3.0625 </td>
</tr>
</tbody>
<tbody><tr>
<td>2 </td> <td>-1.75 </td> <td>3.0625 </td>
</tr>
</tbody>
<tbody><tr>
<td>10 </td> <td>6.25 </td> <td>39.0625 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Variance = the mean squared deviation from average</span>
<span class="n">variance</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">squared_deviations</span><span class="p">)</span>
<span class="n">variance</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>13.1875</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Variance:</strong> The mean squared deviation is called the <em>variance</em> of a sequence of values. While it does give us an idea of spread, it is not on the same scale as the original variable and its units are the square of the original. This makes interpretation very difficult. So we return to the original scale by taking the positive square root of the variance:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Standard Deviation: root mean squared deviation from average</span>
<span class="c1"># Steps of calculation: 5 4 3 2 1</span>
<span class="n">sd</span> <span class="o">=</span> <span class="n">variance</span> <span class="o">**</span> <span class="mf">0.5</span>
<span class="n">sd</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.6314597615834874</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Standard-Deviation">Standard Deviation<a class="anchor-link" href="#Standard-Deviation">¶</a></h2><p>The quantity that we have just computed is called the <em>standard deviation</em> of the list, and is abbreviated as SD. It measures roughly how far the numbers on the list are from their average.</p>
<p><strong>Definition.</strong> The SD of a list is defined as the <em>root mean square of deviations from average</em>. That's a mouthful. But read it from right to left and you have the sequence of steps in the calculation.</p>
<p><strong>Computation.</strong> The <code>numPy</code> function <code>np.std</code> computes the SD of values in an array:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">numbers</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.6314597615834874</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Working-with-the-SD">Working with the SD<a class="anchor-link" href="#Working-with-the-SD">¶</a></h3><p>Let us now examine the SD in the context of a more interesting dataset. The table <code>nba</code> contains data on the players in the National Basketball Association (NBA) in 2013. For each player, the table records the position at which the player usually played, his height in inches, weight in pounds, and age in years.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">"nba2013.csv"</span><span class="p">)</span>
<span class="n">nba</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
</tr>
</thead>
<tbody>
<tr>
<td>DeQuan Jones </td> <td>Guard </td> <td>80 </td> <td>221 </td> <td>23 </td>
</tr>
</tbody>
<tbody><tr>
<td>Darius Miller </td> <td>Guard </td> <td>80 </td> <td>235 </td> <td>23 </td>
</tr>
</tbody>
<tbody><tr>
<td>Trevor Ariza </td> <td>Guard </td> <td>80 </td> <td>210 </td> <td>28 </td>
</tr>
</tbody>
<tbody><tr>
<td>James Jones </td> <td>Guard </td> <td>80 </td> <td>215 </td> <td>32 </td>
</tr>
</tbody>
<tbody><tr>
<td>Wesley Johnson </td> <td>Guard </td> <td>79 </td> <td>215 </td> <td>26 </td>
</tr>
</tbody>
<tbody><tr>
<td>Klay Thompson </td> <td>Guard </td> <td>79 </td> <td>205 </td> <td>23 </td>
</tr>
</tbody>
<tbody><tr>
<td>Thabo Sefolosha</td> <td>Guard </td> <td>79 </td> <td>215 </td> <td>29 </td>
</tr>
</tbody>
<tbody><tr>
<td>Chase Budinger </td> <td>Guard </td> <td>79 </td> <td>218 </td> <td>25 </td>
</tr>
</tbody>
<tbody><tr>
<td>Kevin Martin </td> <td>Guard </td> <td>79 </td> <td>185 </td> <td>30 </td>
</tr>
</tbody>
<tbody><tr>
<td>Evan Fournier </td> <td>Guard </td> <td>79 </td> <td>206 </td> <td>20 </td>
</tr>
</tbody>
</table>
<p>... (495 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here is a histogram of the players' heights.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">68</span><span class="p">,</span> <span class="mi">88</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Spread_22_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It is no surprise that NBA players are tall! Their average height is just over 79 inches (6'7"), about 10 inches taller than the average height of men in the United States.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mean_height</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">nba</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">))</span>
<span class="n">mean_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>79.065346534653472</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>About far off are the players' heights from the average? This is measured by the SD of the heights, which is about 3.45 inches.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">sd_height</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">nba</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">))</span>
<span class="n">sd_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.4505971830275546</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The towering center Hasheem Thabeet of the Oklahoma City Thunder was the tallest player at a height of 87 inches.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">,</span> <span class="n">descending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hasheem Thabeet</td> <td>Center </td> <td>87 </td> <td>263 </td> <td>26 </td>
</tr>
</tbody>
<tbody><tr>
<td>Roy Hibbert </td> <td>Center </td> <td>86 </td> <td>278 </td> <td>26 </td>
</tr>
</tbody>
<tbody><tr>
<td>Tyson Chandler </td> <td>Center </td> <td>85 </td> <td>235 </td> <td>30 </td>
</tr>
</tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Thabeet was about 8 inches above the average height.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="mi">87</span> <span class="o">-</span> <span class="n">mean_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>7.9346534653465284</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>That's a deviation from average, and it is about 2.3 times the standard deviation:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="p">(</span><span class="mi">87</span> <span class="o">-</span> <span class="n">mean_height</span><span class="p">)</span><span class="o">/</span><span class="n">sd_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>2.2995015194397923</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In other words, the height of the tallest player was about 2.3 SDs above average.</p>
<p>At 69 inches tall, Isaiah Thomas was one of the two shortest players in the NBA. His height was about 2.9 SDs below average.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
</tr>
</thead>
<tbody>
<tr>
<td>Isaiah Thomas </td> <td>Guard </td> <td>69 </td> <td>185 </td> <td>24 </td>
</tr>
</tbody>
<tbody><tr>
<td>Nate Robinson </td> <td>Guard </td> <td>69 </td> <td>180 </td> <td>29 </td>
</tr>
</tbody>
<tbody><tr>
<td>John Lucas III</td> <td>Guard </td> <td>71 </td> <td>157 </td> <td>30 </td>
</tr>
</tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="p">(</span><span class="mi">69</span> <span class="o">-</span> <span class="n">mean_height</span><span class="p">)</span><span class="o">/</span><span class="n">sd_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-2.9169868288775844</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What we have observed is that the tallest and shortest players were both just a few SDs away from the average height. This is an example of why the SD is a useful measure of spread. No matter what the shape of the histogram, the average and the SD together tell you a lot about the data.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="First-main-reason-for-measuring-spread-by-the-SD">First main reason for measuring spread by the SD<a class="anchor-link" href="#First-main-reason-for-measuring-spread-by-the-SD">¶</a></h3><p><strong>Informal statement.</strong> For all lists, the bulk of the entries are within the range "average $\pm$ a few SDs".</p>
<p>Try to resist the desire to know exactly what fuzzy words like "bulk" and "few" mean. We wil make them precise later in this section. For now, let us examine the statement in the context of some examples.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We have already seen that <em>all</em> of the heights of the NBA players were in the range "average $\pm$ 3 SDs".</p>
<p>What about the ages? Here is a histogram of the distribution. Yes, there was a 15-year-old player in the NBA in 2013.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Age in 2013'</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">45</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Spread_39_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Juwan Howard was the oldest player, at 40. His age was about 3.2 SDs above average.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Age in 2013'</span><span class="p">,</span> <span class="n">descending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
</tr>
</thead>
<tbody>
<tr>
<td>Juwan Howard</td> <td>Forward </td> <td>81 </td> <td>250 </td> <td>40 </td>
</tr>
</tbody>
<tbody><tr>
<td>Marcus Camby</td> <td>Center </td> <td>83 </td> <td>235 </td> <td>39 </td>
</tr>
</tbody>
<tbody><tr>
<td>Derek Fisher</td> <td>Guard </td> <td>73 </td> <td>210 </td> <td>39 </td>
</tr>
</tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ages</span> <span class="o">=</span> <span class="n">nba</span><span class="p">[</span><span class="s1">'Age in 2013'</span><span class="p">]</span>
<span class="p">(</span><span class="mi">40</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">ages</span><span class="p">))</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">ages</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.1958482778922357</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The youngest was 15-year-old Jarvis Varnado, who won the NBA Championship that year with the Miami Heat. His age was about 2.6 SDs below average.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Age in 2013'</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jarvis Varnado </td> <td>Forward </td> <td>81 </td> <td>230 </td> <td>15 </td>
</tr>
</tbody>
<tbody><tr>
<td>Giannis Antetokounmpo</td> <td>Forward </td> <td>81 </td> <td>205 </td> <td>18 </td>
</tr>
</tbody>
<tbody><tr>
<td>Sergey Karasev </td> <td>Guard </td> <td>79 </td> <td>197 </td> <td>19 </td>
</tr>
</tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="p">(</span><span class="mi">15</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">ages</span><span class="p">))</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">ages</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-2.5895811038670811</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What we have observed for the heights and ages is true in great generality. For all lists, the bulk of the entries are no more than 2 or 3 SDs away from the average.</p>
<p>These rough statements about deviations from average are made precise by the following result.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Chebychev's-inequality">Chebychev's inequality<a class="anchor-link" href="#Chebychev's-inequality">¶</a></h3><p>For all lists, and all positive numbers <em>z</em>, the proportion of entries that are in the range
"average $\pm z$ SDs" is <strong>at least</strong> $1 - \frac{1}{z^2}$.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What makes this result powerful is that it is true for all lists – all distributions, no matter how irregular.</p>
<p>Specifically, Chebychev's inequality says that for every list:</p>
<ul>
<li><p>the proportion in the range "average $\pm$ 2 SDs" is at least 1 - 1/4 = 0.75</p>
</li>
<li><p>the proportion in the range "average $\pm$ 3 SDs" is at least 1 - 1/9 $\approx$ 0.89</p>
</li>
<li><p>the proportion in the range "average $\pm$ 4.5 SDs" is at least 1 - 1/$4.5^2$ $\approx$ 0.95</p>
</li>
</ul>
<p>Chebychev's Inequality gives a lower bound, not an exact answer or an approximation. For example, the percent of entries in the range "average $\pm ~2$ SDs" might be quite a bit larger than 75%. But it cannot be smaller.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Standard-units">Standard units<a class="anchor-link" href="#Standard-units">¶</a></h3><p>In the calculations above, $z$ measures <em>standard units</em>, the number of standard deviations above average.</p>
<p>Some values of standard units are negative, corresponding to original values that are below average. Other values of standard units are positive. But no matter what the distribution of the list looks like, Chebychev's Inequality says that standard units will typically be in the (-5, 5) range.</p>
<p>To convert a value to standard units, first find how far it is from average, and then compare that deviation with the standard deviation.
$$
z ~=~ \frac{\mbox{value }-\mbox{ average}}{\mbox{SD}}
$$</p>
<p>As we will see, standard units are frequently used in data analysis. So it is useful to define a function that converts an array of numbers to standard units.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">standard_units</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">):</span>
<span class="s2">"Convert any array of numbers to standard units."</span>
<span class="k">return</span> <span class="p">(</span><span class="n">any_numbers</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">))</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">)</span>
</pre></div></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As we saw in an earlier section, the table <code>ua</code> contains a column <code>Departure Delay</code> consisting of the departure delay times, in minutes, of over 37,000 United Airlines flights. We can create a new column called <code>Delay_su</code> by applying the function <code>standard_units</code> to the column of delay times. This allows us to see all the delay times in minutes as well as their corresponding values in standard units.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ua</span><span class="o">.</span><span class="n">append_column</span><span class="p">(</span><span class="s1">'z'</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">ua</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Departure Delay'</span><span class="p">)))</span>
<span class="n">ua</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Departure Delay</th> <th>z</th>
</tr>
</thead>
<tbody>
<tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
<tbody><tr>
<td>-1 </td> <td>-0.391942</td>
</tr>
</tbody>
</table>
<p>... (37353 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Something rather alarming happens when we sort the delay times from highest to lowest. The standard units are extremely high!</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ua</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">descending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Departure Delay</th> <th>z</th>
</tr>
</thead>
<tbody>
<tr>
<td>886 </td> <td>22.9631</td>
</tr>
</tbody>
<tbody><tr>
<td>739 </td> <td>19.0925</td>
</tr>
</tbody>
<tbody><tr>
<td>678 </td> <td>17.4864</td>
</tr>
</tbody>
<tbody><tr>
<td>595 </td> <td>15.301 </td>
</tr>
</tbody>
<tbody><tr>
<td>566 </td> <td>14.5374</td>
</tr>
</tbody>
<tbody><tr>
<td>558 </td> <td>14.3267</td>
</tr>
</tbody>
<tbody><tr>
<td>543 </td> <td>13.9318</td>
</tr>
</tbody>
<tbody><tr>
<td>541 </td> <td>13.8791</td>
</tr>
</tbody>
<tbody><tr>
<td>537 </td> <td>13.7738</td>
</tr>
</tbody>
<tbody><tr>
<td>513 </td> <td>13.1419</td>
</tr>
</tbody>
</table>
<p>... (37353 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What this shows is that it is possible for data to be many SDs above average (and for flights to be delayed by almost 15 hours). The highest value is more than 22 in standard units.</p>
<p>However, the proportion of these extreme values is tiny, and the bounds in Chebychev's inequality still hold true. For example, let us calculate the percent of delay times that are in the range "average $\pm$ 2 SDs". This is the same as the percent of times for which the standard units are in the range (-2, 2). That is just about 96%, as computed below, consistent with Chebychev's bound of "at least 75%".</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">within_2_sd</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">logical_and</span><span class="p">(</span><span class="n">ua</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'z'</span><span class="p">)</span> <span class="o">></span> <span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="n">ua</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'z'</span><span class="p">)</span> <span class="o"><</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">ua</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">within_2_sd</span><span class="p">)</span><span class="o">.</span><span class="n">num_rows</span><span class="o">/</span><span class="n">ua</span><span class="o">.</span><span class="n">num_rows</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.960522441988063</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The histogram of delay times is shown below, with the horizontal axis in standard units. By the table above, the right hand tail continues all the way out to 22 standard units (886 minutes). The area of the histogram outside the range $z=-2$ to $z=2$ is about 4%, put together in tiny little bits that are mostly invisible in a histogram.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ua</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">'z'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">23</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">unit</span><span class="o">=</span><span class="s1">'standard unit'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Spread_58_0.png"/></div></div>