notebooks-html/Spread.html

<div id="ipython-notebook">
            <a class="interact-button" href="http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/airline_ontime.csv&path=notebooks/nba2013.csv&path=notebooks/Spread.ipynb">Interact</a>
            
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    tex2jax: {
      inlineMath: [['$','$']],
      processEscapes: true
    }
  });
</script>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Variability">Variability<a class="anchor-link" href="#Variability">¶</a></h1></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="The-Rough-Size-of-Deviations-from-Average">The Rough Size of Deviations from Average<a class="anchor-link" href="#The-Rough-Size-of-Deviations-from-Average">¶</a></h3><p>As we have seen, inference based on test statistics must take into account the way in which the statistics vary across samples. It is therefore important to be able to quantify the variability in any list of numbers. One way is to create a measure of the difference between the values in the list and the average of the list.</p>
<p>We will start by defining such a measure in the context of a simple array of just four <code>numbers</code>.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">numbers</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="n">numbers</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>array([ 1,  2,  2, 10])</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The goal is to get a measure of roughly how far off the numbers in the list are from the average. To do this, we need the average, and all the deviations from the average. A "deviation from average" is just a value minus the average.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Step 1. The average.</span>

<span class="n">average</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">numbers</span><span class="p">)</span>
<span class="n">average</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.75</pre></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Step 2. The deviations from average.</span>

<span class="n">deviations</span> <span class="o">=</span> <span class="n">numbers</span> <span class="o">-</span> <span class="n">average</span>
<span class="n">numbers_and_deviations</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
        <span class="s1">'value'</span><span class="p">,</span> <span class="n">numbers</span><span class="p">,</span>
        <span class="s1">'deviation from average'</span><span class="p">,</span> <span class="n">deviations</span>
        <span class="p">])</span>
<span class="n">numbers_and_deviations</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>value</th> <th>deviation from average</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1    </td> <td>-2.75                 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>2    </td> <td>-1.75                 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>2    </td> <td>-1.75                 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>10   </td> <td>6.25                  </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Some of the deviations are negative; those correspond to values that are below average. Positive deviations correspond to above-average values.</p>
<p>To calculate roughly how big the deviations are, it is natural to compute the mean of the deviations. But something interesting happens when all the deviations are added together:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">sum</span><span class="p">(</span><span class="n">deviations</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.0</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The positive deviations exactly cancel out the negative ones. This is true of all lists of numbers, no matter what the histogram of the list looks like: the sum of the deviations from average is zero.</p>
<p>Since the sum of the deviations is 0, the mean of the deviations will be 0 as well:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">deviations</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.0</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Because of this, the mean of the deviations is not a useful measure of the size of the deviations. What we really want to know is roughly how big the deviations are, regardless of whether they are positive or negative. So we need a way to eliminate the signs of the deviations.</p>
<p>There are two time-honored ways of losing signs: the absolute value, and the square. It turns out that taking the square constructs a measure with extremely powerful properties, some of which we will study in this course.</p>
<p>So let us eliminate the signs by squaring all the deviations. Then we will take the mean of the squares:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Step 3. The squared deviations from average</span>

<span class="n">squared_deviations</span> <span class="o">=</span> <span class="n">deviations</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">numbers_and_deviations</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">'squared deviations from average'</span><span class="p">,</span> <span class="n">squared_deviations</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>value</th> <th>deviation from average</th> <th>squared deviations from average</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1    </td> <td>-2.75                 </td> <td>7.5625                         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>2    </td> <td>-1.75                 </td> <td>3.0625                         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>2    </td> <td>-1.75                 </td> <td>3.0625                         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>10   </td> <td>6.25                  </td> <td>39.0625                        </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Variance = the mean squared deviation from average</span>

<span class="n">variance</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">squared_deviations</span><span class="p">)</span>
<span class="n">variance</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>13.1875</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Variance:</strong> The mean squared deviation is called the <em>variance</em> of a sequence of values. While it does give us an idea of spread, it is not on the same scale as the original variable and its units are the square of the original. This makes interpretation very difficult. So we return to the original scale by taking the positive square root of the variance:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Standard Deviation:    root mean squared deviation from average</span>
<span class="c1"># Steps of calculation:   5    4      3       2             1</span>

<span class="n">sd</span> <span class="o">=</span> <span class="n">variance</span> <span class="o">**</span> <span class="mf">0.5</span>
<span class="n">sd</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.6314597615834874</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Standard-Deviation">Standard Deviation<a class="anchor-link" href="#Standard-Deviation">¶</a></h2><p>The quantity that we have just computed is called the <em>standard deviation</em> of the list, and is abbreviated as SD. It measures roughly how far the numbers on the list are from their average.</p>
<p><strong>Definition.</strong> The SD of a list is defined as the <em>root mean square of deviations from average</em>. That's a mouthful. But read it from right to left and you have the sequence of steps in the calculation.</p>
<p><strong>Computation.</strong> The <code>numPy</code> function <code>np.std</code> computes the SD of values in an array:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">numbers</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.6314597615834874</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Working-with-the-SD">Working with the SD<a class="anchor-link" href="#Working-with-the-SD">¶</a></h3><p>Let us now examine the SD in the context of a more interesting dataset. The table <code>nba</code> contains data on the players in the National Basketball Association (NBA) in 2013. For each player, the table records the position at which the player usually played, his height in inches, weight in pounds, and age in years.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s2">"nba2013.csv"</span><span class="p">)</span>
<span class="n">nba</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>DeQuan Jones   </td> <td>Guard   </td> <td>80    </td> <td>221   </td> <td>23         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Darius Miller  </td> <td>Guard   </td> <td>80    </td> <td>235   </td> <td>23         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Trevor Ariza   </td> <td>Guard   </td> <td>80    </td> <td>210   </td> <td>28         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>James Jones    </td> <td>Guard   </td> <td>80    </td> <td>215   </td> <td>32         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Wesley Johnson </td> <td>Guard   </td> <td>79    </td> <td>215   </td> <td>26         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Klay Thompson  </td> <td>Guard   </td> <td>79    </td> <td>205   </td> <td>23         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Thabo Sefolosha</td> <td>Guard   </td> <td>79    </td> <td>215   </td> <td>29         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Chase Budinger </td> <td>Guard   </td> <td>79    </td> <td>218   </td> <td>25         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Kevin Martin   </td> <td>Guard   </td> <td>79    </td> <td>185   </td> <td>30         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Evan Fournier  </td> <td>Guard   </td> <td>79    </td> <td>206   </td> <td>20         </td>
        </tr>
    </tbody>
</table>
<p>... (495 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here is a histogram of the players' heights.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">68</span><span class="p">,</span> <span class="mi">88</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Spread_22_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It is no surprise that NBA players are tall! Their average height is just over 79 inches (6'7"), about 10 inches taller than the average height of men in the United States.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mean_height</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">nba</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">))</span>
<span class="n">mean_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>79.065346534653472</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>About far off are the players' heights from the average? This is measured by the SD of the heights, which is about 3.45 inches.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">sd_height</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">nba</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">))</span>
<span class="n">sd_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.4505971830275546</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The towering center Hasheem Thabeet of the Oklahoma City Thunder was the tallest player at a height of 87 inches.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">,</span> <span class="n">descending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Hasheem Thabeet</td> <td>Center  </td> <td>87    </td> <td>263   </td> <td>26         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Roy Hibbert    </td> <td>Center  </td> <td>86    </td> <td>278   </td> <td>26         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Tyson Chandler </td> <td>Center  </td> <td>85    </td> <td>235   </td> <td>30         </td>
        </tr>
    </tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Thabeet was about 8 inches above the average height.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="mi">87</span> <span class="o">-</span> <span class="n">mean_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>7.9346534653465284</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>That's a deviation from average, and it is about 2.3 times the standard deviation:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="p">(</span><span class="mi">87</span> <span class="o">-</span> <span class="n">mean_height</span><span class="p">)</span><span class="o">/</span><span class="n">sd_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>2.2995015194397923</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In other words, the height of the tallest player was about 2.3 SDs above average.</p>
<p>At 69 inches tall, Isaiah Thomas was one of the two shortest players in the NBA. His height was about 2.9 SDs below average.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Height'</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Isaiah Thomas </td> <td>Guard   </td> <td>69    </td> <td>185   </td> <td>24         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Nate Robinson </td> <td>Guard   </td> <td>69    </td> <td>180   </td> <td>29         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>John Lucas III</td> <td>Guard   </td> <td>71    </td> <td>157   </td> <td>30         </td>
        </tr>
    </tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="p">(</span><span class="mi">69</span> <span class="o">-</span> <span class="n">mean_height</span><span class="p">)</span><span class="o">/</span><span class="n">sd_height</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-2.9169868288775844</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What we have observed is that the tallest and shortest players were both just a few SDs away from the average height. This is an example of why the SD is a useful measure of spread. No matter what the shape of the histogram, the average and the SD together tell you a lot about the data.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="First-main-reason-for-measuring-spread-by-the-SD">First main reason for measuring spread by the SD<a class="anchor-link" href="#First-main-reason-for-measuring-spread-by-the-SD">¶</a></h3><p><strong>Informal statement.</strong> For all lists, the bulk of the entries are within the range "average $\pm$ a few SDs".</p>
<p>Try to resist the desire to know exactly what fuzzy words like "bulk" and "few" mean. We wil make them precise later in this section. For now, let us examine the statement in the context of some examples.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We have already seen that <em>all</em> of the heights of the NBA players were in the range "average $\pm$ 3 SDs".</p>
<p>What about the ages? Here is a histogram of the distribution. Yes, there was a 15-year-old player in the NBA in 2013.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Age in 2013'</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">45</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Spread_39_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Juwan Howard was the oldest player, at 40. His age was about 3.2 SDs above average.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Age in 2013'</span><span class="p">,</span> <span class="n">descending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Juwan Howard</td> <td>Forward </td> <td>81    </td> <td>250   </td> <td>40         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Marcus Camby</td> <td>Center  </td> <td>83    </td> <td>235   </td> <td>39         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Derek Fisher</td> <td>Guard   </td> <td>73    </td> <td>210   </td> <td>39         </td>
        </tr>
    </tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ages</span> <span class="o">=</span> <span class="n">nba</span><span class="p">[</span><span class="s1">'Age in 2013'</span><span class="p">]</span>
<span class="p">(</span><span class="mi">40</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">ages</span><span class="p">))</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">ages</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>3.1958482778922357</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The youngest was 15-year-old Jarvis Varnado, who won the NBA Championship that year with the Miami Heat. His age was about 2.6 SDs below average.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">nba</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s1">'Age in 2013'</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Name</th> <th>Position</th> <th>Height</th> <th>Weight</th> <th>Age in 2013</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Jarvis Varnado       </td> <td>Forward </td> <td>81    </td> <td>230   </td> <td>15         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Giannis Antetokounmpo</td> <td>Forward </td> <td>81    </td> <td>205   </td> <td>18         </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>Sergey Karasev       </td> <td>Guard   </td> <td>79    </td> <td>197   </td> <td>19         </td>
        </tr>
    </tbody>
</table>
<p>... (502 rows omitted)</p></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="p">(</span><span class="mi">15</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">ages</span><span class="p">))</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">ages</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-2.5895811038670811</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What we have observed for the heights and ages is true in great generality. For all lists, the bulk of the entries are no more than 2 or 3 SDs away from the average.</p>
<p>These rough statements about deviations from average are made precise by the following result.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Chebychev's-inequality">Chebychev's inequality<a class="anchor-link" href="#Chebychev's-inequality">¶</a></h3><p>For all lists, and all positive numbers <em>z</em>, the proportion of entries that are in the range
"average $\pm z$ SDs" is <strong>at least</strong> $1 - \frac{1}{z^2}$.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What makes this result powerful is that it is true for all lists – all distributions, no matter how irregular.</p>
<p>Specifically, Chebychev's inequality says that for every list:</p>
<ul>
<li><p>the proportion in the range "average $\pm$ 2 SDs" is at least 1 - 1/4 = 0.75</p>
</li>
<li><p>the proportion in the range "average $\pm$ 3 SDs" is at least 1 - 1/9 $\approx$ 0.89</p>
</li>
<li><p>the proportion in the range "average $\pm$ 4.5 SDs" is at least 1 - 1/$4.5^2$ $\approx$ 0.95</p>
</li>
</ul>
<p>Chebychev's Inequality gives a lower bound, not an exact answer or an approximation. For example, the percent of entries in the range "average $\pm ~2$ SDs" might be quite a bit larger than 75%. But it cannot be smaller.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Standard-units">Standard units<a class="anchor-link" href="#Standard-units">¶</a></h3><p>In the calculations above, $z$ measures <em>standard units</em>, the number of standard deviations above average.</p>
<p>Some values of standard units are negative, corresponding to original values that are below average. Other values of standard units are positive. But no matter what the distribution of the list looks like, Chebychev's Inequality says that standard units will typically be in the (-5, 5) range.</p>
<p>To convert a value to standard units, first find how far it is from average, and then compare that deviation with the standard deviation.
$$
z ~=~ \frac{\mbox{value }-\mbox{ average}}{\mbox{SD}}
$$</p>
<p>As we will see, standard units are frequently used in data analysis. So it is useful to define a function that converts an array of numbers to standard units.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">standard_units</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">):</span>
    <span class="s2">"Convert any array of numbers to standard units."</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">any_numbers</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">))</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">any_numbers</span><span class="p">)</span>    
</pre></div></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As we saw in an earlier section, the table <code>ua</code> contains a column <code>Departure Delay</code> consisting of the departure delay times, in minutes, of over 37,000 United Airlines flights. We can create a new column called <code>Delay_su</code> by applying the function <code>standard_units</code> to the column of delay times. This allows us to see all the delay times in minutes as well as their corresponding values in standard units.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ua</span><span class="o">.</span><span class="n">append_column</span><span class="p">(</span><span class="s1">'z'</span><span class="p">,</span> <span class="n">standard_units</span><span class="p">(</span><span class="n">ua</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Departure Delay'</span><span class="p">)))</span>
<span class="n">ua</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Departure Delay</th> <th>z</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>-1             </td> <td>-0.391942</td>
        </tr>
    </tbody>
</table>
<p>... (37353 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Something rather alarming happens when we sort the delay times from highest to lowest. The standard units are extremely high!</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ua</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">descending</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Departure Delay</th> <th>z</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>886            </td> <td>22.9631</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>739            </td> <td>19.0925</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>678            </td> <td>17.4864</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>595            </td> <td>15.301 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>566            </td> <td>14.5374</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>558            </td> <td>14.3267</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>543            </td> <td>13.9318</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>541            </td> <td>13.8791</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>537            </td> <td>13.7738</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>513            </td> <td>13.1419</td>
        </tr>
    </tbody>
</table>
<p>... (37353 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What this shows is that it is possible for data to be many SDs above average (and for flights to be delayed by almost 15 hours). The highest value is more than 22 in standard units.</p>
<p>However, the proportion of these extreme values is tiny, and the bounds in Chebychev's inequality still hold true. For example, let us calculate the percent of delay times that are in the range "average $\pm$ 2 SDs". This is the same as the percent of times for which the standard units are in the range (-2, 2). That is just about 96%, as computed below, consistent with Chebychev's bound of "at least 75%".</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">within_2_sd</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">logical_and</span><span class="p">(</span><span class="n">ua</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'z'</span><span class="p">)</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="n">ua</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'z'</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">ua</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">within_2_sd</span><span class="p">)</span><span class="o">.</span><span class="n">num_rows</span><span class="o">/</span><span class="n">ua</span><span class="o">.</span><span class="n">num_rows</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.960522441988063</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The histogram of delay times is shown below, with the horizontal axis in standard units. By the table above, the right hand tail continues all the way out to 22 standard units (886 minutes). The area of the histogram outside the range $z=-2$ to $z=2$ is about 4%, put together in tiny little bits that are mostly invisible in a histogram.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">ua</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">'z'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">23</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">unit</span><span class="o">=</span><span class="s1">'standard unit'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Spread_58_0.png"/></div></div>