notebooks-html/Residuals_Sp16.html

<div id="ipython-notebook">
            <a class="interact-button" href="http://datahub.berkeley.edu/user-redirect/interact?repo=textbook&path=notebooks/baby.csv&path=notebooks/us_women.csv&path=notebooks/Residuals_Sp16.ipynb">Interact</a>
            
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    tex2jax: {
      inlineMath: [['$','$']],
      processEscapes: true
    }
  });
</script>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The methods that we have developed for inference in regression are valid if the regression model holds for our data. If the data do not satisfy the assumptions of the model, then it can be hard to justify calculations that assume that the model is good.</p>
<p>In this section we will develop some simple descriptive methods that can help us decide whether our data satisfy the regression model.</p>
<p>We will start with a review of the details of the model, which specifies that the points in the scatter plot are generated at random as follows.</p>
<ul>
<li>Tyche creates the scatter plot by starting with points that lie perfectly on a straight line and then pushing them off the line vertically, either above or below, as follows:<ul>
<li>For each $x$, Tyche finds the corresponding point on the true line, and then adds an error.</li>
<li>The errors are drawn at random with replacement from a population of errors that has a normal distribution with mean 0 and an unknown standard deviation.</li>
<li>Tyche creates a point whose horizontal coordinate is $x$ and whose vertical coordinate is "the height of the true line at $x$, plus the error". If the error is positive, the point is above the line. If the error is negative, the point is below the line.</li>
</ul>
</li>
<li>We get to see the resulting points but not the true line on which the points started out nor the random errors that Tyche added.</li>
</ul></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="First-Diagnostic:-The-Scatter-Plot">First Diagnostic: The Scatter Plot<a class="anchor-link" href="#First-Diagnostic:-The-Scatter-Plot">¶</a></h3><p>It is always helpful to start with a visualization. Continuing the example of estimating birth weight based on gestational days, let us look at a scatter plot of the two variables.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">baby</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_5_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The points do appear to be roughly clustered around a straight line. There is no strikingly noticeable curve in the scatter.</p>
<p>A plot of the regression line through the scatter plot shows that about half the points are above the line and half are below the line almost symmetrically, supporting the model's assumption that Tyche is adding random errors that are normally distributed with mean 0.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">scatter_fit</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_7_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Second-Diagnostic:-A-Residual-Plot">Second Diagnostic: A Residual Plot<a class="anchor-link" href="#Second-Diagnostic:-A-Residual-Plot">¶</a></h3><p>We cannot see the true line that is the basis of the regression model, but the regression line that we calculate acts as a substitute for it. So also the vertical distances of the points from the regression line act as substitutes for the random errors that Tyche uses to push the points vertically off the true line.</p>
<p>As we have seen earlier, the vertical distances of the points from the regression line are called <em>residuals</em>. There is one residual for each point in the scatter plot. The residual is the difference between the observed value of $y$ and the fitted value of $y$:</p>
<p>For the point $(x, y)$,</p>
$$
\mbox{residual} ~~ = ~~ y ~-~
\mbox{fitted value of }y
~~ = ~~ y ~-~
\mbox{height of regression line at }y
$$<p>The table <code>baby_regression</code> contains the data, the fitted values and the residuals:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">baby2</span> <span class="o">=</span> <span class="n">baby</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">])</span>
<span class="n">fitted</span> <span class="o">=</span> <span class="n">fit</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
<span class="n">residuals</span> <span class="o">=</span> <span class="n">baby</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Birth Weight'</span><span class="p">)</span> <span class="o">-</span> <span class="n">fitted</span>
<span class="n">baby_regression</span> <span class="o">=</span> <span class="n">baby2</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
        <span class="s1">'Fitted Value'</span><span class="p">,</span> <span class="n">fitted</span><span class="p">,</span>
        <span class="s1">'Residual'</span><span class="p">,</span> <span class="n">residuals</span>
    <span class="p">])</span>
<span class="n">baby_regression</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Gestational Days</th> <th>Birth Weight</th> <th>Fitted Value</th> <th>Residual</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>284             </td> <td>120         </td> <td>121.748     </td> <td>-1.74801</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>282             </td> <td>113         </td> <td>120.815     </td> <td>-7.8149 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>279             </td> <td>128         </td> <td>119.415     </td> <td>8.58477 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>282             </td> <td>108         </td> <td>120.815     </td> <td>-12.8149</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>286             </td> <td>136         </td> <td>122.681     </td> <td>13.3189 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>244             </td> <td>138         </td> <td>103.086     </td> <td>34.9143 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>245             </td> <td>132         </td> <td>103.552     </td> <td>28.4477 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>289             </td> <td>120         </td> <td>124.081     </td> <td>-4.0808 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>299             </td> <td>143         </td> <td>128.746     </td> <td>14.2536 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>351             </td> <td>140         </td> <td>153.007     </td> <td>-13.0073</td>
        </tr>
    </tbody>
</table>
<p>... (1164 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Once again, it helps to visualize. The figure below shows four of the residuals; for each of the four points, the length of the red segment is the residual for that point. There is one such residual for each of the other 1170 points as well.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">points</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
        <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="p">[</span><span class="mi">148</span><span class="p">,</span> <span class="mi">244</span><span class="p">,</span> <span class="mi">338</span><span class="p">,</span> <span class="mi">351</span><span class="p">],</span>
        <span class="s1">'Birth Weight'</span><span class="p">,</span> <span class="p">[</span><span class="mi">116</span><span class="p">,</span> <span class="mi">138</span><span class="p">,</span> <span class="mi">102</span><span class="p">,</span> <span class="mi">140</span><span class="p">]</span>
    <span class="p">])</span>

<span class="n">scatter_fit</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">):</span>
    <span class="n">f</span> <span class="o">=</span> <span class="n">fitted_value</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">,</span> <span class="n">points</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">0</span><span class="p">)[</span><span class="n">i</span><span class="p">])</span>
    <span class="n">plots</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="n">points</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">0</span><span class="p">)[</span><span class="n">i</span><span class="p">]]</span><span class="o">*</span><span class="mi">2</span><span class="p">,</span> <span class="p">[</span><span class="n">f</span><span class="p">,</span> <span class="n">points</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">1</span><span class="p">)[</span><span class="n">i</span><span class="p">]],</span> <span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_11_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>According to the regression model, Tyche generates errors by drawing at random with replacement from a population of errors that is normally distributed with mean 0. To see whether this looks plausible for our data, we will examine the residuals, which are our substitutes for Tyche's errors.</p>
<p>A <em>residual plot</em> can be drawn by plotting the residuals against $x$. The function <code>residual_plot</code> does just that. The function <code>regression_diagnostic_plots</code> draws the origbinal scatter plot as well as the residual plot for ease of comparison.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">residual_plot</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">fitted</span> <span class="o">=</span> <span class="n">fit</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
    <span class="n">residuals</span> <span class="o">=</span> <span class="n">table</span><span class="p">[</span><span class="n">y</span><span class="p">]</span> <span class="o">-</span> <span class="n">fitted</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
            <span class="n">x</span><span class="p">,</span> <span class="n">table</span><span class="p">[</span><span class="n">x</span><span class="p">],</span>
            <span class="s1">'residuals'</span><span class="p">,</span> <span class="n">residuals</span>
        <span class="p">])</span>
    <span class="n">t</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="s1">'residuals'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">'r'</span><span class="p">)</span>
    <span class="n">xlims</span> <span class="o">=</span> <span class="p">[</span><span class="nb">min</span><span class="p">(</span><span class="n">table</span><span class="p">[</span><span class="n">x</span><span class="p">]),</span> <span class="nb">max</span><span class="p">(</span><span class="n">table</span><span class="p">[</span><span class="n">x</span><span class="p">])]</span>
    <span class="n">plots</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">xlims</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s1">'darkblue'</span><span class="p">,</span> <span class="n">lw</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
    <span class="n">plots</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Residual Plot'</span><span class="p">)</span>
</pre></div></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">regression_diagnostic_plots</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">scatter_fit</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
    <span class="n">residual_plot</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</pre></div></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">regression_diagnostic_plots</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_15_0.png"/></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_15_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The height of each red point in the plot above is a residual. Notice how the residuals are distributed fairly symmetrically above and below the horizontal line at 0. Notice also that the vertical spread of the plot is fairly even across the most common values of $x$, in about the 200-300 range of gestational days. In other words, apart from a few outlying points, the plot isn't narrower in some places and wider in others.</p>
<p>All of this is consistent with the model's assumptions that the errors are drawn at random with replacement from a normally distributed population with mean 0. Indeed, a histogram of the residuals is centered at 0 and looks roughly bell shaped.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">baby_regression</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Residual'</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_17_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>These observations show us that the data are fairly consistent with the assumptions of the regression model. Therefore it makes sense to do inference based on the model, as we did in the previous section.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Using-the-Residual-Plot-to-Detect-Nonlinearity">Using the Residual Plot to Detect Nonlinearity<a class="anchor-link" href="#Using-the-Residual-Plot-to-Detect-Nonlinearity">¶</a></h3><p>Residual plots can be used to detect whether the regression model does not hold. The most serious departure from the model is non-linearity.</p>
<p>If two variables have a non-linear relation, the non-linearity is sometimes visible in the scatter plot. Often, however, it is easier to spot non-linearity in a residual plot than in the original scatter plot. This is usually because of the scales of the two plots: the residual plot allows us to zoom in on the errors and hence makes it easier to spot patterns.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="https://upload.wikimedia.org/wikipedia/commons/7/75/Dugong_dugon.jpg"/></p>
<p>As an example, we will study a <a href="http://www.statsci.org/data/oz/dugongs.html">dataset</a> on the age and length of dugongs, which are marine mammals related to manatees and sea cows. The data are in a table called <code>dugong</code>. Age is measured in years and length in meters. The ages are estimated; most dugongs don't keep track of their birthdays.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">dugong</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'http://www.statsci.org/data/oz/dugongs.txt'</span><span class="p">)</span>
<span class="n">dugong</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Age</th> <th>Length</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1   </td> <td>1.8   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1.5 </td> <td>1.85  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1.5 </td> <td>1.87  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1.5 </td> <td>1.77  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>2.5 </td> <td>2.02  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>4   </td> <td>2.27  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>5   </td> <td>2.15  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>5   </td> <td>2.26  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>7   </td> <td>2.35  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>8   </td> <td>2.47  </td>
        </tr>
    </tbody>
</table>
<p>... (17 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here is a regression of length (the response) on age (the predictor). The correlation between the two variables is about 0.83, which is quite high.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">dugong</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">,</span> <span class="s1">'Length'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.82964745549057139</pre></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">scatter_fit</span><span class="p">(</span><span class="n">dugong</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">,</span> <span class="s1">'Length'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_24_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>High correlation notwithstanding, the plot shows a curved pattern that is much more visible in the residual plot.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">regression_diagnostic_plots</span><span class="p">(</span><span class="n">dugong</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">,</span> <span class="s1">'Length'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_26_0.png"/></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_26_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>At the low end of ages, the residuals are almost all negative; then they are almost all positive; then negative again at the high end of ages. Such a discernible pattern would have been unlikely had Tyche been picking errors at random with replacement and using them to push points off a straight line. Therefore the regression model does not hold for these data.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As another example, the dataset <code>us_women</code> contains the average weights of U.S. women of different heights in the 1970s. Weights are measured in pounds and heights in inches. The correlation is spectacularly high – more than 0.99.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">us_women</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'us_women.csv'</span><span class="p">)</span>
<span class="n">correlation</span><span class="p">(</span><span class="n">us_women</span><span class="p">,</span> <span class="s1">'height'</span><span class="p">,</span> <span class="s1">'ave weight'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.99549476778421608</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The points seem to hug the regression line pretty closely, but the residual plot flashes a warning sign: it is U-shaped, indicating that the relation between the two variables is not linear.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">regression_diagnostic_plots</span><span class="p">(</span><span class="n">us_women</span><span class="p">,</span> <span class="s1">'height'</span><span class="p">,</span> <span class="s1">'ave weight'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_31_0.png"/></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_31_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>These examples demonstrate that even when correlation is high, fitting a straight line might not be the right thing to do. Residual plots help us decide whether or not to use linear regression.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Using-the-Residual-Plot-to-Detect-Heteroscedasticity">Using the Residual Plot to Detect Heteroscedasticity<a class="anchor-link" href="#Using-the-Residual-Plot-to-Detect-Heteroscedasticity">¶</a></h3></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><em>Heteroscedasticity</em> is a word that will surely be of interest to those who are preparing for Spelling Bees. For data scientists, its interest lies in its meaning, which is "uneven spread".</p>
<p>Recall the table <code>hybrid</code> that contains data on hybrid cars in the U.S. Here is a regression of fuel efficiency on the rate of acceleration. The association is negative: cars that accelearate quickly tend to be less efficient.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">regression_diagnostic_plots</span><span class="p">(</span><span class="n">hybrid</span><span class="p">,</span> <span class="s1">'acceleration'</span><span class="p">,</span> <span class="s1">'mpg'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_35_0.png"/></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_35_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Notice how the residual plot flares out towards the low end of the accelerations. In other words, the variability in the size of the errors is greater for low values of acceleration than for high values. Such a pattern would be unlikely under the regression model, because the model assumes that Tyche picks the errors at random with replacement <em>from the same normally distributed population</em>.</p>
<p>Uneven vertical variation in a residual plot can indicate that the regression model does not hold. Uneven variation is often more easily noticed in a residual plot than in the original scatter plot.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="The-Accuracy-of-the-Regression-Estimate">The Accuracy of the Regression Estimate<a class="anchor-link" href="#The-Accuracy-of-the-Regression-Estimate">¶</a></h3><p>We will end our discussion of regression by pointing out some interesting mathematical facts that lead to a measure of the accuracy of regression. We will leave the proofs to another course, but we will demonstrate the facts by examples. Note that all of the facts hold for all scatter plots, whether or not the regression model is good.</p>
<p><strong>Fact 1.</strong>
No matter what the shape of the scatter diagram, the residuals have an average of 0.</p>
<p>In all the residual plots above, you have seen the horizontal line at 0 going through the center of the plot. That is a visualization of this fact.</p>
<p>As a numerical example, here is the average of the residuals in the regression of birth weight on gestational days of babies:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">baby_regression</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Residual'</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-6.3912532279613784e-15</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>That doesn't look like 0, but in fact it is a very small number that is 0 apart from rounding error.</p>
<p>Here is the average of the residuals in the regression of the length of dugongs on their age:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">dugong_residuals</span> <span class="o">=</span> <span class="n">dugong</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Length'</span><span class="p">)</span> <span class="o">-</span> <span class="n">fit</span><span class="p">(</span><span class="n">dugong</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">,</span> <span class="s1">'Length'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dugong_residuals</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-2.8783559897689243e-16</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Once again the mean of the residuals is 0 apart from rounding error.</p>
<p>This is analogous to the fact that if you take any list of numbers and calculate the list of deviations from average, the average of the deviations is 0.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Fact 2.</strong> No matter what the shape of the scatter diagram, the SD of the fitted values is $|r|$ times the SD of the observed values of $y$.</p>
<p>To understand this result, it is worth noting that the fitted values are all on the regression line whereas the observed values of $y$ are the heights of all the points in the scatter plot and are more variable.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">scatter_fit</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="/notebooks-images/Residuals_Sp16_43_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The fitted values of birth weight are, therefore, less variable than the observed values of birth weight. But how much less variable?</p>
<p>The scatter plot shows a correlation of about 0.4:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.40754279338885108</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here is ratio of the SD of the fitted values and the SD of the observed values of birth weight:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">fitted_birth_weight</span> <span class="o">=</span> <span class="n">baby_regression</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Fitted Value'</span><span class="p">)</span>
<span class="n">observed_birth_weight</span> <span class="o">=</span> <span class="n">baby_regression</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Birth Weight'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">fitted_birth_weight</span><span class="p">)</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">observed_birth_weight</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.40754279338885108</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The ratio is $r$. Thus the SD of the fitted values is $r$ times the SD of the observed values of $y$.</p>
<p>The example of fuel efficiency and acceleration is one where the correlation is negative:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlation</span><span class="p">(</span><span class="n">hybrid</span><span class="p">,</span> <span class="s1">'acceleration'</span><span class="p">,</span> <span class="s1">'mpg'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>-0.5060703843771186</pre></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">fitted_mpg</span> <span class="o">=</span> <span class="n">fit</span><span class="p">(</span><span class="n">hybrid</span><span class="p">,</span> <span class="s1">'acceleration'</span><span class="p">,</span> <span class="s1">'mpg'</span><span class="p">)</span>
<span class="n">observed_mpg</span> <span class="o">=</span> <span class="n">hybrid</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'mpg'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">fitted_mpg</span><span class="p">)</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">observed_mpg</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.5060703843771186</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The ratio of the two SDs is $|r|$. Thus the SD of the fitted values is $|r|$ times the SD of the observed values of fuel efficiency.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The relation says that the closer the scatter diagram is to a straight line, the closer the fitted values will be to the observed values of $y$, and therefore the closer the variance of the fitted values will be to the variance of the observed values.</p>
<p>Therefore, the closer the scatter diagram is to a straight line, the closer $r^2$ will be to 1, and the closer $r$ will be to 1 or -1.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Fact 3.</strong> No matter what the shape of the scatter diagram, the SD of the residuals is $\sqrt{1-r^2}$ times the SD of the observed values of $y$. In other words, the root mean squared error of regression is $\sqrt{1-r^2}$ times the SD of $y$.</p>
<p>This fact gives a measure of the accuracy of the regression estimate.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Again, we will demonstrate this by example. In the case of gestational days and birth weights, $r$ is about 0.4 and $\sqrt{1-r^2}$ is about 0.91:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">r</span> <span class="o">=</span> <span class="n">correlation</span><span class="p">(</span><span class="n">baby</span><span class="p">,</span> <span class="s1">'Gestational Days'</span><span class="p">,</span> <span class="s1">'Birth Weight'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">r</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.91318611003278638</pre></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">residuals</span> <span class="o">=</span> <span class="n">baby_regression</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Residual'</span><span class="p">)</span>
<span class="n">observed_birth_weight</span> <span class="o">=</span> <span class="n">baby_regression</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Birth Weight'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">residuals</span><span class="p">)</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">observed_birth_weight</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.9131861100327866</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Thus the SD of the residuals is $\sqrt{1-r^2}$ times the SD of the observed birth weights.</p>
<p>It will come as no surprise that the same relation is true for the regression of fuel efficiency on acceleration:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">r</span> <span class="o">=</span> <span class="n">correlation</span><span class="p">(</span><span class="n">hybrid</span><span class="p">,</span> <span class="s1">'acceleration'</span><span class="p">,</span> <span class="s1">'mpg'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">r</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.862492183185677</pre></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">residuals</span> <span class="o">=</span> <span class="n">hybrid</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'mpg'</span><span class="p">)</span> <span class="o">-</span> <span class="n">fitted_mpg</span>
<span class="n">observed_mpg</span> <span class="o">=</span> <span class="n">hybrid</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'mpg'</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">residuals</span><span class="p">)</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">observed_mpg</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.862492183185677</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let us examine the extreme cases of the general fact that the SD of the residuals is $\sqrt{1-r^2}$ times the SD of the observed values of $y$. Remember the residuals always average out to 0.</p>
<ul>
<li>If $r=1$ or $r=-1$, then $\sqrt{1-r^2} = 0$. Therefore the residuals have an average of 0 and an SD of 0 as well; therefore, the residuals are all equal to 0. This is consistent with the observation that if $r = \pm 1$, the scatter plot is a perfect straight line and there is no error in the regression estimate. </li>
<li>If $r=0$, the $\sqrt{1-r^2} =1$ and the SD of the regression estimate is equal to the SD of $y$. This is consistent with the observation that if $r=0$ then the regression line is a flat line at average of $y$, and the root mean square error of regression is the root mean squared deviation from the average of $y$, which is the SD of $y$.</li>
</ul>
<p>For every other value of $r$, the rough size of the error of the regression estimate is somewhere between 0 and the SD of $y$.</p></div></div></div>