notebooks-html/Permutation.html

<div id="ipython-notebook">
            <a class="interact-button" href="http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/couples.csv&path=notebooks/football.csv&path=notebooks/Permutation.ipynb">Interact</a>
            
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    tex2jax: {
      inlineMath: [['$','$']],
      processEscapes: true
    }
  });
</script>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Comparing-Two-Groups">Comparing Two Groups<a class="anchor-link" href="#Comparing-Two-Groups">¶</a></h2><p>In the examples above, we investigated whether a sample appears to be chosen randomly from an underlying population. We did this by comparing the distribution of the sample with the distribution of the population. A similar line of reasoning can be used to compare the distributions of two samples. In particular, we can investigate whether or not two samples appear to be drawn from the same underlying distribution.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Example:-Married-Couples-and-Unmarried-Partners">Example: Married Couples and Unmarried Partners<a class="anchor-link" href="#Example:-Married-Couples-and-Unmarried-Partners">¶</a></h3><p>Our next example is based on a study conducted in 2010 under the auspices of the National Center for Family and Marriage Research.</p>
<p>In the United States, the proportion of couples who live together but are not married has been rising in recent decades. The study involved a national random sample of over 1,000 heterosexual couples who were either married or "cohabiting partners" – living together but unmarried. One of the goals of the study was to compare the attitudes and experiences of the married and unmarried couples.</p>
<p>The table below shows a subset of the data collected in the study. Each row corresponds to one person. The variables that we will examine in this section are:</p>
<ul>
<li>Marital Status: married or unmarried</li>
<li>Employment Status: one of several categories described below</li>
<li>Gender</li>
<li>Age: Age in years</li>
</ul></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">,</span> <span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">,</span> <span class="p">]</span>
<span class="n">couples</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'couples.csv'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">columns</span><span class="p">)</span>
<span class="n">couples</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Marital Status</th> <th>Employment Status</th> <th>Gender</th> <th>Age</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>married       </td> <td>working as paid employee</td> <td>male  </td> <td>51  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td> <td>female</td> <td>53  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td> <td>male  </td> <td>57  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td> <td>female</td> <td>57  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td> <td>male  </td> <td>60  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td> <td>female</td> <td>57  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working, self-employed  </td> <td>male  </td> <td>62  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td> <td>female</td> <td>59  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>not working - other     </td> <td>male  </td> <td>53  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>not working - retired   </td> <td>female</td> <td>61  </td>
        </tr>
    </tbody>
</table>
<p>... (2056 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let us consider just the males first. There are 742 married couples and 292 unmarried couples, and all couples in this study had one male and one female, making 1,034 males in all.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Separate tables for married and cohabiting unmarried couples:</span>

<span class="n">married_men</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'married'</span><span class="p">)</span>
<span class="n">partnered_men</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'partner'</span><span class="p">)</span>

<span class="c1"># Let's see how many married and unmarried people there are:</span>
<span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">)</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="s2">"Marital Status"</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_5_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Societal norms have changed over the decades, and there has been a gradual acceptance of couples living together without being married. Thus it is natural to expect that unmarried couples will in general consist of younger people than married couples.</p>
<p>The histograms of the ages of the married and unmarried men show that this is indeed the case. We will draw these histograms and compare them. In order to compare two histograms, both should be drawn to the same scale. Let us write a function that does this for us.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">plot_age</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">subject</span><span class="p">):</span>
    <span class="sd">"""</span>
<span class="sd">    Draws a histogram of the Age column in the given table.</span>
<span class="sd">    </span>
<span class="sd">    table should be a Table with a column of people's ages called Age.</span>
<span class="sd">    </span>
<span class="sd">    subject should be a string -- the name of the group we're displaying,</span>
<span class="sd">    like "married men".</span>
<span class="sd">    """</span>
    <span class="c1"># Draw a histogram of ages running from 15 years to 70 years</span>
    <span class="n">table</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">'Age'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">71</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">unit</span><span class="o">=</span><span class="s1">'year'</span><span class="p">)</span>
    <span class="c1"># Set the lower and upper bounds of the vertical axis so that</span>
    <span class="c1"># the plots we make are all comparable.</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.045</span><span class="p">)</span>
    <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Ages of "</span> <span class="o">+</span> <span class="n">subject</span><span class="p">)</span>
</pre></div></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Ages of men:</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">married_men</span><span class="p">,</span> <span class="s2">"married men"</span><span class="p">)</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">partnered_men</span><span class="p">,</span> <span class="s2">"cohabiting unmarried men"</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_8_0.png"/></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_8_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The difference is even more marked when we compare the married and unmarried women.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">married_women</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'married'</span><span class="p">)</span>
<span class="n">partnered_women</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'partner'</span><span class="p">)</span>

<span class="c1"># Ages of women:</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">married_women</span><span class="p">,</span> <span class="s2">"married women"</span><span class="p">)</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">partnered_women</span><span class="p">,</span> <span class="s2">"cohabiting unmarried women"</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_10_0.png"/></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_10_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The histograms show that the married men in the sample are in general older than unmarried cohabiting men. Married women are in general older than unmarried women. These observations are consistent with what we had predicted based on changing social norms.</p>
<p>If married couples are in general older, they might differ from unmarried couples in other ways as well. Let us compare the employment status of the married and unmarried men in the sample.</p>
<p>The table below shows the marital status and employment status of each man in the sample.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">males</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">])</span>
<span class="n">males</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Marital Status</th> <th>Employment Status</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>married       </td> <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working, self-employed  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>not working - other     </td>
        </tr>
    </tbody>
</table>
<p>... (1028 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Contingency-Tables">Contingency Tables<a class="anchor-link" href="#Contingency-Tables">¶</a></h3></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To investigate the association between employment and marriage, we would like to be able to ask questions like, "How many married men are retired?"</p>
<p>Recall that the method <code>pivot</code> lets us do exactly that. It <em>cross-classifies</em> each man according to the two variables – marital status and employment status. Its output is a <em>contingency table</em> that contains the counts in each pair of categories.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">employment</span> <span class="o">=</span> <span class="n">males</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
<span class="n">employment</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Employment Status</th> <th>married</th> <th>partner</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>not working - disabled                        </td> <td>44     </td> <td>20     </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - looking for work                </td> <td>28     </td> <td>33     </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - on a temporary layoff from a job</td> <td>15     </td> <td>8      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - other                           </td> <td>16     </td> <td>9      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - retired                         </td> <td>44     </td> <td>4      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee                      </td> <td>513    </td> <td>170    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working, self-employed                        </td> <td>82     </td> <td>47     </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The arguments of <code>pivot</code> are the labels of the two columns corresponding to the variables we are studying. Categories of the first argument appear as columns; categories of the second argument are the rows.  Each cell of the table contains the number of men in a pair of categories – a particular employment status and a particular marital status.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The table shows that regardless of marital status, the men in the sample are most likely to be working as paid employees. But it is quite hard to compare the entire distributions based on this table, because the numbers of married and unmarried men in the sample are not the same. There are 742 married men but only 291 unmarried ones.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">employment</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>married</th> <th>partner</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>742    </td> <td>291    </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To adjust for this difference in total numbers, we will convert the counts into proportions, by dividing all the <code>married</code> counts by 742 and all the <code>partner</code> counts by 291.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">proportions</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
        <span class="s2">"Employment Status"</span><span class="p">,</span> <span class="n">cc</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s2">"Employment Status"</span><span class="p">),</span>
        <span class="s2">"married"</span><span class="p">,</span> <span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)),</span>
        <span class="s2">"partner"</span><span class="p">,</span> <span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">))</span>
    <span class="p">])</span>
<span class="n">proportions</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Employment Status</th> <th>married</th> <th>partner</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>not working - disabled                        </td> <td>0.0592992</td> <td>0.0687285</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - looking for work                </td> <td>0.0377358</td> <td>0.113402 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - on a temporary layoff from a job</td> <td>0.0202156</td> <td>0.0274914</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - other                           </td> <td>0.0215633</td> <td>0.0309278</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - retired                         </td> <td>0.0592992</td> <td>0.0137457</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee                      </td> <td>0.691375 </td> <td>0.584192 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working, self-employed                        </td> <td>0.110512 </td> <td>0.161512 </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The <code>married</code> column of this table shows the distribution of employment status of the married men in the sample. For example, among married men, the proportion who are retired is about 0.059. The <code>partner</code> column shows the distribution of the employment status of the unmarried men in the sample. Among unmarried men, the proportion who are retired is about 0.014.</p>
<p>The two distributions look different from each other in other ways too, as can be seen more clearly in the bar graphs below. It appears that a larger proportion of the married men in the sample work as paid employees, whereas a larger proportion of the unmarried men are not working but are looking for work.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">proportions</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_22_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The distributions of employment status of the men in the two groups – married and unmarried – is clearly different in the sample.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Are-the-two-distributions-different-in-the-population?">Are the two distributions different in the population?<a class="anchor-link" href="#Are-the-two-distributions-different-in-the-population?">¶</a></h3><p>This raises the question of whether the difference is due to randomness in the sampling, or whether the distributions of employment status are indeed different for married and umarried cohabiting men in the U.S. Remember that the data that we have are from a sample of just 1,033 couples; we do not know the distribution of employment status of married or unmarried cohabiting men in the entire country.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We can answer the question by performing a statistical test of hypotheses. Let us use the terminology that we developed for this in the previous section.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Null hypothesis.</strong> In the United States, the distribution of employment status among married men is the same as among unmarried men who live with their partners.</p>
<p>Another way of saying this is that employment status and marital status are <em>independent</em> or <em>not associated</em>.</p>
<p>If the null hypothesis were true, then the difference that we have observed in the sample would be just due to chance.</p>
<p><strong>Alternative hypothesis.</strong> In the United States, the distributions of the employment status of the two groups of men are different. In other words, employment status and marital status are associated in some way.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As our <strong>test statistic</strong>, we will use the total variation distance between two distributions.</p>
<p>The observed value of the test statistic is about 0.15:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># TVD between the two distributions in the sample</span>
<span class="n">married</span> <span class="o">=</span> <span class="n">proportions</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span>
<span class="n">partner</span> <span class="o">=</span> <span class="n">proportions</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span>
<span class="n">observed_tvd</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">married</span> <span class="o">-</span> <span class="n">partner</span><span class="p">))</span>
<span class="n">observed_tvd</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.15273571011754242</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Random-Permutations">Random Permutations<a class="anchor-link" href="#Random-Permutations">¶</a></h3><p>In order to compare this observed value of the total variation distance with what is predicted by the null hypothesis, we need to know how the total variation distance would vary across all possible random samples if employment status and marital status were not related.</p>
<p>This is quite daunting to derive by mathematics, but let us see if we can get a good approximation by simulation.</p>
<p>With just one sample at hand, and no further knowledge of the distribution of employment status among men in the United States, how can we go about replicating the sampling procedure? The key is to note that <em>if</em> marital status and employment status were not connected in any way, then we could replicate the sampling process by replacing each man's employment status by a randomly picked employment status from among all the men, married and unmarried.</p>
<p>Doing this for all the men is equivalent to randomly rearranging the entire column containing employment status, while leaving the marital status column unchanged. Such a rearrangement is called a <em>random permutation</em>.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Thus, under the null hypothesis, we can replicate the sampling process by assigning to each man an employment status chosen at random without replacement from the entries in the column <code>Employment Status</code>. We can do the replication by simply permuting the entire <code>Employment Status</code> column and leaving everything else unchanged.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let's implement this plan. First, we will shuffle the column <code>empl_status</code> using the <code>sample</code> method, which just shuffles all the rows when provided with no arguments.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Randomly permute the employment status of all men</span>

<span class="n">shuffled</span> <span class="o">=</span> <span class="n">males</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="n">shuffled</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Employment Status</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>working, self-employed  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working, self-employed  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working, self-employed  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - disabled  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee</td>
        </tr>
    </tbody>
</table>
<p>... (1023 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The first two columns of the table below are taken from the original sample. The third has been created by randomly permuting the original <code>Employment Status</code> column.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Construct a table in which employment status has been shuffled</span>

<span class="n">males_with_shuffled_empl</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
        <span class="s2">"Marital Status"</span><span class="p">,</span> <span class="n">males</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">),</span>
        <span class="s2">"Employment Status"</span><span class="p">,</span> <span class="n">males</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">),</span>
        <span class="s2">"Employment Status (shuffled)"</span><span class="p">,</span> <span class="n">shuffled</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span>
    <span class="p">])</span>
<span class="n">males_with_shuffled_empl</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Marital Status</th> <th>Employment Status</th> <th>Employment Status (shuffled)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>married       </td> <td>working as paid employee                      </td> <td>working, self-employed      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee                      </td> <td>working, self-employed      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee                      </td> <td>working as paid employee    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working, self-employed                        </td> <td>working as paid employee    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>not working - other                           </td> <td>working, self-employed      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>not working - on a temporary layoff from a job</td> <td>working as paid employee    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>not working - disabled                        </td> <td>not working - disabled      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee                      </td> <td>working as paid employee    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>working as paid employee                      </td> <td>working as paid employee    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>married       </td> <td>not working - retired                         </td> <td>working as paid employee    </td>
        </tr>
    </tbody>
</table>
<p>... (1023 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Once again, the <code>pivot</code> method computes the contingency table, which allows us to calculate the total variation distance between the distributions of the two groups of men after their employment status has been shuffled.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">employment_shuffled</span> <span class="o">=</span> <span class="n">males_with_shuffled_empl</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status (shuffled)'</span><span class="p">)</span>
<span class="n">employment_shuffled</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Employment Status (shuffled)</th> <th>married</th> <th>partner</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>not working - disabled                        </td> <td>48     </td> <td>16     </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - looking for work                </td> <td>44     </td> <td>17     </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - on a temporary layoff from a job</td> <td>16     </td> <td>7      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - other                           </td> <td>18     </td> <td>7      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>not working - retired                         </td> <td>39     </td> <td>9      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working as paid employee                      </td> <td>489    </td> <td>194    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>working, self-employed                        </td> <td>88     </td> <td>41     </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># TVD between the two distributions in the contingency table above</span>
<span class="n">e_s</span> <span class="o">=</span> <span class="n">employment_shuffled</span>
<span class="n">married</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">))</span>
<span class="n">partner</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">))</span>
<span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">married</span> <span class="o">-</span> <span class="n">partner</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.032423745611841297</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This total variation distance was computed based on the null hypothesis that the distributions of employment status for the two groups of men are the same. You can see that it is noticeably smaller than the observed value of the total variation distance (0.15) between the two groups in our original sample.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="A-Permutation-Test">A Permutation Test<a class="anchor-link" href="#A-Permutation-Test">¶</a></h3><p>Could this just be due to chance variation? We will only know if we run many more replications, by randomly permuting the <code>Employment Status</code> column repeatedly. This method of testing is known as a <strong>permutation test</strong>.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Put it all together in a for loop to perform a permutation test</span>

<span class="n">repetitions</span> <span class="o">=</span> <span class="mi">500</span>

<span class="n">tvds</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s2">"TVD between married and partnered men"</span><span class="p">,</span> <span class="p">[])</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">repetitions</span><span class="p">):</span>
    <span class="c1"># Construct a permuted table</span>
    <span class="n">shuffled</span> <span class="o">=</span> <span class="n">males</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
    <span class="n">combined</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
            <span class="s2">"Marital Status"</span><span class="p">,</span> <span class="n">males</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">),</span>
            <span class="s2">"Employment Status"</span><span class="p">,</span> <span class="n">shuffled</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span>
        <span class="p">])</span>
    <span class="n">employment_shuffled</span> <span class="o">=</span> <span class="n">combined</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
    
    <span class="c1"># Compute TVD</span>
    <span class="n">e_s</span> <span class="o">=</span> <span class="n">employment_shuffled</span>
    <span class="n">married</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">))</span>
    <span class="n">partner</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">))</span>
    <span class="n">permutation_tvd</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">married</span> <span class="o">-</span> <span class="n">partner</span><span class="p">))</span>
    <span class="n">tvds</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">permutation_tvd</span><span class="p">])</span>

<span class="n">tvds</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_40_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The figure above is the <strong>empirical distribution of the total variation distance</strong> between the distributions of the employment status of married and unmarried men, under the null hypothesis.</p>
<p><strong>The observed test statistic of 0.15 is quite far in the tail, and so the chance of observing such an extreme value under the null hypothesis is close to 0</strong>.</p>
<p>As before, this chance is called an empirical P-value. The P-value is the chance that our test statistic (TVD) would come out at least as extreme as the observed value (in this case 0.15 or greater) under the null hypothesis.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Conclusion-of-the-test">Conclusion of the test<a class="anchor-link" href="#Conclusion-of-the-test">¶</a></h3><p>Our empirical estimate based on repeated sampling gives us all the information we need for drawing conclusions from the data: the observed statistic is very unlikely under the null hypothesis.</p>
<p>The low P-value constitutes <strong>evidence in favor of the alternative hypothesis</strong>. The data support the hypothesis that in the United States, the distribution of the employment status of married men is not the same as that of unmarried men who live with their partners.</p>
<p>We have just completed our first <em>permutation test</em>. Permutation tests are quite common in practice because they make very few assumptions about the underlying population and are straightforward to perform and interpret.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Note about the approximate P-value</strong></p>
<p>Our simulation gives us an approximate empirical P-value, because it is based on just 500 random permutations instead of all the possible random permutations. We can compute this empirical P-value directly, without drawing the histogram:</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_p_value</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">count_nonzero</span><span class="p">(</span><span class="n">tvds</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">observed_tvd</span><span class="p">)</span> <span class="o">/</span> <span class="n">tvds</span><span class="o">.</span><span class="n">num_rows</span>
<span class="n">empirical_p_value</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.0</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Computing the exact P-value would require us to consider all possible outcomes of shuffling (which is very large) instead of 500 random shuffles. If we had performed all the random shuffles, there would have been a few with more extreme TVDs. The true P-value is greater than zero, but not by much.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Generalizing-Our-Hypothesis-Test">Generalizing Our Hypothesis Test<a class="anchor-link" href="#Generalizing-Our-Hypothesis-Test">¶</a></h3><p>The example above includes a substantial amount of code in order to investigate the relationship between two characteristics (marital status and employment status) for a particular subset of the surveyed population (males). Suppose we would like to investigate different characteristics or a different population. How can we reuse the code we have written so far in order to explore more relationships?</p>
<p>When you are about to copy your code, you should think, "Maybe I should write some functions."</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What functions to write? A good way to make this decision is to think about what you have to compute repeatedly.</p>
<p>In our example, the total variation distance is computed over and over again. So we will begin with a generalized computation of total variation distance between the distribution of any column of values (such as employment status) when separated into any two conditions (such as marital status) for a collection of data described by any table. Our implementation includes the same statements as we used above, but uses generic names that are specified by the final function call.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># TVD between the distributions of values under any two conditions</span>

<span class="k">def</span> <span class="nf">tvd</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">):</span>
    <span class="sd">"""Compute the total variation distance </span>
<span class="sd">    between proportions of values under two conditions.</span>
<span class="sd">    </span>
<span class="sd">    t          (Table) -- a table</span>
<span class="sd">    conditions (str)   -- a column label in t; should have only two categories</span>
<span class="sd">    values     (str)   -- a column label in t</span>
<span class="sd">    """</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">)</span>
    <span class="n">a</span> <span class="o">=</span> <span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
    <span class="k">return</span> <span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">a</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">-</span> <span class="n">b</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">b</span><span class="p">)))</span>

<span class="n">tvd</span><span class="p">(</span><span class="n">males</span><span class="p">,</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.15273571011754242</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next, we can write a function that performs a permutation test using this <code>tvd</code> function to compute the same statistic on shuffled variants of any table. It's worth reading through this implementation to understand its details.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">permutation_tvd</span><span class="p">(</span><span class="n">original</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">):</span>
    <span class="sd">"""</span>
<span class="sd">    Perform a permutation test of whether </span>
<span class="sd">    the distribution of values for two conditions </span>
<span class="sd">    is the same in the population,</span>
<span class="sd">    using the total variation distance between two distributions</span>
<span class="sd">    as the test statistic.</span>
<span class="sd">    </span>
<span class="sd">    original is a Table with two columns.  The value of the argument</span>
<span class="sd">    conditions is the name of one column, and the value of the argument</span>
<span class="sd">    values is the name of the other column.  The conditions table should</span>
<span class="sd">    have only 2 possible values corresponding to 2 categories in the</span>
<span class="sd">    data.</span>
<span class="sd">    </span>
<span class="sd">    The values column is shuffled many times, and the data are grouped</span>
<span class="sd">    according to the conditions column.  The total variation distance</span>
<span class="sd">    between the proportions values in the 2 categories is computed.  </span>
<span class="sd">    </span>
<span class="sd">    Then we draw a histogram of all those TV distances.  This shows us </span>
<span class="sd">    what the TVD between the values of the two distributions would typically</span>
<span class="sd">    look like if the values were independent of the conditions.</span>
<span class="sd">    """</span>
    <span class="n">repetitions</span> <span class="o">=</span> <span class="mi">500</span>
    <span class="n">stats</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">repetitions</span><span class="p">):</span>
        <span class="n">shuffled</span> <span class="o">=</span> <span class="n">original</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
        <span class="n">combined</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
                <span class="n">conditions</span><span class="p">,</span> <span class="n">original</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">conditions</span><span class="p">),</span>
                <span class="n">values</span><span class="p">,</span> <span class="n">shuffled</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
            <span class="p">])</span>
        <span class="n">stats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">tvd</span><span class="p">(</span><span class="n">combined</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">))</span>

    <span class="n">observation</span> <span class="o">=</span> <span class="n">tvd</span><span class="p">(</span><span class="n">original</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">)</span>
    <span class="n">p_value</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">count_nonzero</span><span class="p">(</span><span class="n">stats</span> <span class="o">&gt;=</span> <span class="n">observation</span><span class="p">)</span> <span class="o">/</span> <span class="n">repetitions</span>
    
    <span class="nb">print</span><span class="p">(</span><span class="s2">"Observation:"</span><span class="p">,</span> <span class="n">observation</span><span class="p">)</span>
    <span class="nb">print</span><span class="p">(</span><span class="s2">"Empirical P-value:"</span><span class="p">,</span> <span class="n">p_value</span><span class="p">)</span>
    <span class="n">Table</span><span class="p">([</span><span class="n">stats</span><span class="p">],</span> <span class="p">[</span><span class="s1">'Empirical distribution'</span><span class="p">])</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span>
</pre></div></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">permutation_tvd</span><span class="p">(</span><span class="n">males</span><span class="p">,</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Observation: 0.152735710118
Empirical P-value: 0.0
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_51_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Now that we have generalized our permutation test, we can apply it to other hypotheses. For example, we can compare the distribution over the employment status of women, grouping them by their marital status. In the case of men we found a difference, but what about with women? First, we can visualize the two distributions.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">compare_bar</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">):</span>
    <span class="sd">"""Bargraphs of distributions of values for each of two conditions."""</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">label</span> <span class="ow">in</span> <span class="n">e</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">labels</span><span class="p">:</span>
        <span class="c1"># Convert each column of counts into proportions</span>
        <span class="n">e</span><span class="o">.</span><span class="n">append_column</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">label</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">label</span><span class="p">)))</span> 
    <span class="n">e</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>

<span class="n">compare_bar</span><span class="p">(</span><span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">),</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_53_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>A glance at the figure shows that the two distributions are different in the sample. The difference in the category "Not working – other" is particularly striking: about 22% of the married women are in this category, compared to only about 8% of the unmarried women. There are several reasons for this difference. For example, the percent of homemakers is greater among married women than among unmarried women, possibly because married women are more likely to be "stay-at-home" mothers of young children. The difference could also be generational: as we saw earlier, the married couples are older than the unmarried partners, and older women are less likely to be in the workforce than younger women.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>While we can see that the distributions are different in the sample, we are not really interested in the sample for its own sake. We are examining the sample because it is likely to reflect the population. So, as before, we will use the sample to try to answer a question about something unknown: the distributions of employment status of married and unmarried cohabiting women <em>in the United States</em>. That is the population from which the sample was drawn.</p>
<p>We have to consider the possibility that the observed difference in the sample could simply be the result of chance variation. Remember that our data are only from a random sample of couples. We do not have data for all the couples in the United States.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Null hypothesis</strong>: In the U.S., the distribution of employment status is the same for married women as for unmarried women living with their partners. The difference in the sample is due to chance.</p>
<p><strong>Alternative hypothesis</strong>: In the U.S., the distributions of employment status among married and unmarried cohabiting women are different.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Another-permutation-test-to-compare-distributions">Another permutation test to compare distributions<a class="anchor-link" href="#Another-permutation-test-to-compare-distributions">¶</a></h3><p>We can test these hypotheses just as we did for men, by using the function <code>permuation_tvd</code> that we defined for this purpose.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">permutation_tvd</span><span class="p">(</span><span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">),</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Observation: 0.194755513565
Empirical P-value: 0.0
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_58_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As for the males, the empirical P-value is 0 based on a lare number of repetitions. So the exact P-value is close to 0, which is evidence in favor of the alternative hypothesis. The data support the hypothesis that for women in the United States, employment status is associated with whether they are married or unmarried and living with their partners.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Another-example">Another example<a class="anchor-link" href="#Another-example">¶</a></h3><p>Are gender and employment status independent in the population? We are now in a position to test this quite swiftly:</p>
<p><strong>Null hypothesis.</strong> Among married and unmarried cohabiting individuals in the United States, gender is independent of employment status.</p>
<p><strong>Alternative hypothesis.</strong> Among married and unmarried cohabiting people in the United States, gender and employment status are related.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">permutation_tvd</span><span class="p">(</span><span class="n">couples</span><span class="p">,</span> <span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Observation: 0.185866408519
Empirical P-value: 0.0
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_61_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The conclusion of the test is that gender and employment status are not independent in the population. This is no surprise; for example, because of societal norms, older women were less likely to have gone into the workforce than men.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Deflategate:-Permutation-Tests-and-Quantitative-Variables">Deflategate: Permutation Tests and Quantitative Variables<a class="anchor-link" href="#Deflategate:-Permutation-Tests-and-Quantitative-Variables">¶</a></h3><p>On January 18, 2015, the Indianapolis Colts and the New England Patriots played the American Football Conference (AFC) championship game to determine which of those teams would play in the Super Bowl. After the game, there were allegations that the Patriots' footballs had not been inflated as much as the regulations required; they were softer. This could be an advantage, as softer balls might be easier to catch.</p>
<p>For several weeks, the world of American football was consumed by accusations, denials, theories, and suspicions: the press labeled the topic Deflategate, after the Watergate political scandal of the 1970's. The National Football League (NFL) commissioned an independent analysis. In this example, we will perform our own analysis of the data.</p>
<p>Pressure is often measured in pounds per square inch (psi). NFL rules stipulate that game balls must be inflated to have pressures in the range 12.5 psi and 13.5 psi. Each team plays with 12 balls. Teams have the responsibility of maintaining the pressure in their own footballs, but game officials inspect the balls. Before the start of the AFC game, all the Patriots' balls were at about 12.5 psi. Most of the Colts' balls were at about 13.0 psi. However, these pre-game data were not recorded.</p>
<p>During the second quarter, the Colts intercepted a Patriots ball. On the sidelines, they measured the pressure of the ball and determined that it was below the 12.5 psi threshold. Promptly, they informed officials.</p>
<p>At half-time, all the game balls were collected for inspection. Two officials, Clete Blakeman and Dyrol Prioleau, measured the pressure in each of the balls. Here are the data; pressure is measured in psi. The Patriots ball that had been intercepted by the Colts was not inspected at half-time. Nor were most of the Colts' balls – the officials simply ran out of time and had to relinquish the balls for the start of second half play.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">football</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'football.csv'</span><span class="p">)</span>
<span class="n">football</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Team</th> <th>Ball</th> <th>Blakeman</th> <th>Prioleau</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>0   </td> <td>Patriots 1 </td> <td>11.5    </td> <td>11.8    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 2 </td> <td>10.85   </td> <td>11.2    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 3 </td> <td>11.15   </td> <td>11.5    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 4 </td> <td>10.7    </td> <td>11      </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 5 </td> <td>11.1    </td> <td>11.45   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 6 </td> <td>11.6    </td> <td>11.95   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 7 </td> <td>11.85   </td> <td>12.3    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 8 </td> <td>11.1    </td> <td>11.55   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 9 </td> <td>10.95   </td> <td>11.35   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 10</td> <td>10.5    </td> <td>10.9    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 11</td> <td>10.9    </td> <td>11.35   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 1    </td> <td>12.7    </td> <td>12.35   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 2    </td> <td>12.75   </td> <td>12.3    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 3    </td> <td>12.5    </td> <td>12.95   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 4    </td> <td>12.55   </td> <td>12.15   </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>For each of the 15 balls that were inspected, the two officials got different results. It is not uncommon that repeated measurements on the same object yield different results, especially when the measurements are performed by different people. So we will assign to each the ball the average of the two measurements made on that ball.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">football</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
    <span class="s1">'Combined'</span><span class="p">,</span> <span class="p">(</span><span class="n">football</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Blakeman'</span><span class="p">)</span><span class="o">+</span><span class="n">football</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Prioleau'</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span>
    <span class="p">)</span>
<span class="n">football</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Team</th> <th>Ball</th> <th>Blakeman</th> <th>Prioleau</th> <th>Combined</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>0   </td> <td>Patriots 1 </td> <td>11.5    </td> <td>11.8    </td> <td>11.65   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 2 </td> <td>10.85   </td> <td>11.2    </td> <td>11.025  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 3 </td> <td>11.15   </td> <td>11.5    </td> <td>11.325  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 4 </td> <td>10.7    </td> <td>11      </td> <td>10.85   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 5 </td> <td>11.1    </td> <td>11.45   </td> <td>11.275  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 6 </td> <td>11.6    </td> <td>11.95   </td> <td>11.775  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 7 </td> <td>11.85   </td> <td>12.3    </td> <td>12.075  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 8 </td> <td>11.1    </td> <td>11.55   </td> <td>11.325  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 9 </td> <td>10.95   </td> <td>11.35   </td> <td>11.15   </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 10</td> <td>10.5    </td> <td>10.9    </td> <td>10.7    </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 11</td> <td>10.9    </td> <td>11.35   </td> <td>11.125  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 1    </td> <td>12.7    </td> <td>12.35   </td> <td>12.525  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 2    </td> <td>12.75   </td> <td>12.3    </td> <td>12.525  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 3    </td> <td>12.5    </td> <td>12.95   </td> <td>12.725  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 4    </td> <td>12.55   </td> <td>12.15   </td> <td>12.35   </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>At a glance, it seems apparent that the Patriots' footballs were at a lower pressure than the Colts' balls. Because some deflation is normal during the course of a game, the independent analysts decided to calculate the drop in pressure from the start of the game. Recall that the Patriots' balls had all started out at about 12.5 psi, and the Colts' balls at about 13.0 psi. Therefore the drop in pressure for the Patriots' balls was computed as 12.5 minus the pressure at half-time, and the drop in pressure for the Colts' balls was 13.0 minus the pressure at half-time.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">football</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
    <span class="s1">'Drop'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">12.5</span><span class="p">]</span><span class="o">*</span><span class="mi">11</span> <span class="o">+</span> <span class="p">[</span><span class="mf">13.0</span><span class="p">]</span><span class="o">*</span><span class="mi">4</span><span class="p">)</span> <span class="o">-</span> <span class="n">football</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Combined'</span><span class="p">)</span>
    <span class="p">)</span>
<span class="n">football</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
    <thead>
        <tr>
            <th>Team</th> <th>Ball</th> <th>Blakeman</th> <th>Prioleau</th> <th>Combined</th> <th>Drop</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>0   </td> <td>Patriots 1 </td> <td>11.5    </td> <td>11.8    </td> <td>11.65   </td> <td>0.85 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 2 </td> <td>10.85   </td> <td>11.2    </td> <td>11.025  </td> <td>1.475</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 3 </td> <td>11.15   </td> <td>11.5    </td> <td>11.325  </td> <td>1.175</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 4 </td> <td>10.7    </td> <td>11      </td> <td>10.85   </td> <td>1.65 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 5 </td> <td>11.1    </td> <td>11.45   </td> <td>11.275  </td> <td>1.225</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 6 </td> <td>11.6    </td> <td>11.95   </td> <td>11.775  </td> <td>0.725</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 7 </td> <td>11.85   </td> <td>12.3    </td> <td>12.075  </td> <td>0.425</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 8 </td> <td>11.1    </td> <td>11.55   </td> <td>11.325  </td> <td>1.175</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 9 </td> <td>10.95   </td> <td>11.35   </td> <td>11.15   </td> <td>1.35 </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 10</td> <td>10.5    </td> <td>10.9    </td> <td>10.7    </td> <td>1.8  </td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>0   </td> <td>Patriots 11</td> <td>10.9    </td> <td>11.35   </td> <td>11.125  </td> <td>1.375</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 1    </td> <td>12.7    </td> <td>12.35   </td> <td>12.525  </td> <td>0.475</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 2    </td> <td>12.75   </td> <td>12.3    </td> <td>12.525  </td> <td>0.475</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 3    </td> <td>12.5    </td> <td>12.95   </td> <td>12.725  </td> <td>0.275</td>
        </tr>
    </tbody>
        <tbody><tr>
            <td>1   </td> <td>Colts 4    </td> <td>12.55   </td> <td>12.15   </td> <td>12.35   </td> <td>0.65 </td>
        </tr>
    </tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It is apparent that the drop was larger, on average, for the Patriots' footballs. But could the difference be just due to chance?</p>
<p>To answer this, we must first examine how chance might enter the analysis. This is not a situation in which there is a random sample of data from a large population. It is also not clear how to create a justifiable abstract chance model, as the balls were all different, inflated by different people, and maintained under different conditions.</p>
<p>One way to introduce chances is to ask whether the drops in pressures of the 11 Patriots balls and the 4 Colts balls resemble a random permutation of the 15 drops. Then the 4 Colts drops would be a simple random sample of all 15 drops. This gives us a null hypothesis that we can test using random permutations.</p>
<p><strong>Null hypothesis.</strong> The drops in the pressures of the 4 Colts balls are like a random sample (without replacement) from all 15 drops.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="A-new-test-statistic">A new test statistic<a class="anchor-link" href="#A-new-test-statistic">¶</a></h4><p>The data are quantitative, so we cannot compare the two distributions category by category using the total variation distance. If we try to bin the data in order to use the TVD, the choice of bins can a noticeable effect on the statistic. So instead, we will work with a simple statistic based on means. We will just compare the average drops in the two groups.</p>
<p>The observed difference between the average drops in the two groups was about 0.7335 psi.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">patriots</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Team'</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Drop'</span><span class="p">)</span>
<span class="n">colts</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Team'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Drop'</span><span class="p">)</span>
<span class="n">observed_difference</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">patriots</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">colts</span><span class="p">)</span>
<span class="n">observed_difference</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.73352272727272805</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Now the question becomes: If we took a random permutation of the 15 drops, how likely is it that the difference in the means of the first 11 and the last 4 would be at least as large as the difference observed by the officials?</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To answer this, we will randomly permute the 15 drops, assign the first 11 permuted values to the Patriots and the last 4 to the Colts. Then we will find the difference in the means of the two permuted groups.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">drops_shuffled</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Drop'</span><span class="p">)</span>
<span class="n">patriots_shuffled</span> <span class="o">=</span> <span class="n">drops_shuffled</span><span class="p">[:</span><span class="mi">10</span><span class="p">]</span>
<span class="n">colts_shuffled</span> <span class="o">=</span> <span class="n">drops_shuffled</span><span class="p">[</span><span class="mi">11</span><span class="p">:]</span>
<span class="n">shuffled_difference</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">patriots_shuffled</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">colts_shuffled</span><span class="p">)</span>
<span class="n">shuffled_difference</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.010000000000000675</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This is different from the observed value we calculated earlier. But to get a better sense of the variability under random sampling we must repeat the process many times. Let us try making 5000 repetitions and drawing a histogram of the 5000 differences between means.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">repetitions</span> <span class="o">=</span> <span class="mi">5000</span>

<span class="n">test_stats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">repetitions</span><span class="p">):</span>
    <span class="n">drops_shuffled</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Drop'</span><span class="p">)</span>
    <span class="n">patriots_shuffled</span> <span class="o">=</span> <span class="n">drops_shuffled</span><span class="p">[:</span><span class="mi">10</span><span class="p">]</span>
    <span class="n">colts_shuffled</span> <span class="o">=</span> <span class="n">drops_shuffled</span><span class="p">[</span><span class="mi">11</span><span class="p">:]</span>
    <span class="n">shuffled_difference</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">patriots_shuffled</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">colts_shuffled</span><span class="p">)</span>
    <span class="n">test_stats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">shuffled_difference</span><span class="p">)</span>
    
<span class="n">observation</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">patriots</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">colts</span><span class="p">)</span>
<span class="n">emp_p_value</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">count_nonzero</span><span class="p">(</span><span class="n">test_stats</span> <span class="o">&gt;=</span> <span class="n">observation</span><span class="p">)</span> <span class="o">/</span> <span class="n">repetitions</span>

<span class="n">differences</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s1">'Difference in Means'</span><span class="p">,</span> <span class="n">test_stats</span><span class="p">)</span>
<span class="n">differences</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Observation:"</span><span class="p">,</span> <span class="n">observation</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Empirical P-value:"</span><span class="p">,</span> <span class="n">emp_p_value</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Observation: 0.733522727273
Empirical P-value: 0.0028
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_76_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The observed difference was roughly 0.7335 psi. According to the empirical distribution above, there is a very small chance that a random permutation would yield a difference that large. So the data support the conclusion that the two groups of pressures were not like a random permutation of all 15 pressures.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The independent investiagtive team analyzed the data in several different ways, taking into account the laws of physics. The final report said,</p>
<p>"[T]he average pressure drop of the Patriots game balls exceeded the average pressure drop of the Colts balls by 0.45 to 1.02 psi, depending on various possible assumptions regarding the gauges used, and assuming an initial pressure of 12.5 psi for the Patriots balls and 13.0 for the Colts balls."</p>
<p>-- <em>Investigative report commissioned by the NFL regarding the AFC Championship game on January 18, 2015</em></p>
<p>Our analysis shows an average pressure drop of about 0.73 psi, which is consistent with the official analysis.</p>
<p>The all-important question in the football world was whether the excess drop of pressure in the Patriots' footballs was deliberate. To that question, the data have no answer. If you are curious about the answer given by the investigators, here is the <a href="https://nfllabor.files.wordpress.com/2015/05/investigative-and-expert-reports-re-footballs-used-during-afc-championsh.pdf">full report</a>.</p></div></div>
<div class="inner_cell">
    <div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span> 
</pre></div></div></div></div>