/
Permutation.html
1053 lines (1033 loc) · 86.3 KB
/
Permutation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<div id="ipython-notebook">
<a class="interact-button" href="http://data8.berkeley.edu/hub/interact?repo=textbook&path=notebooks/couples.csv&path=notebooks/football.csv&path=notebooks/Permutation.ipynb">Interact</a>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$']],
processEscapes: true
}
});
</script>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Comparing-Two-Groups">Comparing Two Groups<a class="anchor-link" href="#Comparing-Two-Groups">¶</a></h2><p>In the examples above, we investigated whether a sample appears to be chosen randomly from an underlying population. We did this by comparing the distribution of the sample with the distribution of the population. A similar line of reasoning can be used to compare the distributions of two samples. In particular, we can investigate whether or not two samples appear to be drawn from the same underlying distribution.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Example:-Married-Couples-and-Unmarried-Partners">Example: Married Couples and Unmarried Partners<a class="anchor-link" href="#Example:-Married-Couples-and-Unmarried-Partners">¶</a></h3><p>Our next example is based on a study conducted in 2010 under the auspices of the National Center for Family and Marriage Research.</p>
<p>In the United States, the proportion of couples who live together but are not married has been rising in recent decades. The study involved a national random sample of over 1,000 heterosexual couples who were either married or "cohabiting partners" – living together but unmarried. One of the goals of the study was to compare the attitudes and experiences of the married and unmarried couples.</p>
<p>The table below shows a subset of the data collected in the study. Each row corresponds to one person. The variables that we will examine in this section are:</p>
<ul>
<li>Marital Status: married or unmarried</li>
<li>Employment Status: one of several categories described below</li>
<li>Gender</li>
<li>Age: Age in years</li>
</ul></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">,</span> <span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'Age'</span><span class="p">,</span> <span class="p">]</span>
<span class="n">couples</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'couples.csv'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">columns</span><span class="p">)</span>
<span class="n">couples</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Marital Status</th> <th>Employment Status</th> <th>Gender</th> <th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>married </td> <td>working as paid employee</td> <td>male </td> <td>51 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td> <td>female</td> <td>53 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td> <td>male </td> <td>57 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td> <td>female</td> <td>57 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td> <td>male </td> <td>60 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td> <td>female</td> <td>57 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working, self-employed </td> <td>male </td> <td>62 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td> <td>female</td> <td>59 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>not working - other </td> <td>male </td> <td>53 </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>not working - retired </td> <td>female</td> <td>61 </td>
</tr>
</tbody>
</table>
<p>... (2056 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let us consider just the males first. There are 742 married couples and 292 unmarried couples, and all couples in this study had one male and one female, making 1,034 males in all.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Separate tables for married and cohabiting unmarried couples:</span>
<span class="n">married_men</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'married'</span><span class="p">)</span>
<span class="n">partnered_men</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'partner'</span><span class="p">)</span>
<span class="c1"># Let's see how many married and unmarried people there are:</span>
<span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">)</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="s2">"Marital Status"</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_5_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Societal norms have changed over the decades, and there has been a gradual acceptance of couples living together without being married. Thus it is natural to expect that unmarried couples will in general consist of younger people than married couples.</p>
<p>The histograms of the ages of the married and unmarried men show that this is indeed the case. We will draw these histograms and compare them. In order to compare two histograms, both should be drawn to the same scale. Let us write a function that does this for us.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">plot_age</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">subject</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Draws a histogram of the Age column in the given table.</span>
<span class="sd"> </span>
<span class="sd"> table should be a Table with a column of people's ages called Age.</span>
<span class="sd"> </span>
<span class="sd"> subject should be a string -- the name of the group we're displaying,</span>
<span class="sd"> like "married men".</span>
<span class="sd"> """</span>
<span class="c1"># Draw a histogram of ages running from 15 years to 70 years</span>
<span class="n">table</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="s1">'Age'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">71</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">unit</span><span class="o">=</span><span class="s1">'year'</span><span class="p">)</span>
<span class="c1"># Set the lower and upper bounds of the vertical axis so that</span>
<span class="c1"># the plots we make are all comparable.</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.045</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Ages of "</span> <span class="o">+</span> <span class="n">subject</span><span class="p">)</span>
</pre></div></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Ages of men:</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">married_men</span><span class="p">,</span> <span class="s2">"married men"</span><span class="p">)</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">partnered_men</span><span class="p">,</span> <span class="s2">"cohabiting unmarried men"</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_8_0.png"/></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_8_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The difference is even more marked when we compare the married and unmarried women.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">married_women</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'married'</span><span class="p">)</span>
<span class="n">partnered_women</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">)</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'partner'</span><span class="p">)</span>
<span class="c1"># Ages of women:</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">married_women</span><span class="p">,</span> <span class="s2">"married women"</span><span class="p">)</span>
<span class="n">plot_age</span><span class="p">(</span><span class="n">partnered_women</span><span class="p">,</span> <span class="s2">"cohabiting unmarried women"</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_10_0.png"/></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_10_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The histograms show that the married men in the sample are in general older than unmarried cohabiting men. Married women are in general older than unmarried women. These observations are consistent with what we had predicted based on changing social norms.</p>
<p>If married couples are in general older, they might differ from unmarried couples in other ways as well. Let us compare the employment status of the married and unmarried men in the sample.</p>
<p>The table below shows the marital status and employment status of each man in the sample.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">males</span> <span class="o">=</span> <span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'male'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">])</span>
<span class="n">males</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Marital Status</th> <th>Employment Status</th>
</tr>
</thead>
<tbody>
<tr>
<td>married </td> <td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working, self-employed </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>not working - other </td>
</tr>
</tbody>
</table>
<p>... (1028 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Contingency-Tables">Contingency Tables<a class="anchor-link" href="#Contingency-Tables">¶</a></h3></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To investigate the association between employment and marriage, we would like to be able to ask questions like, "How many married men are retired?"</p>
<p>Recall that the method <code>pivot</code> lets us do exactly that. It <em>cross-classifies</em> each man according to the two variables – marital status and employment status. Its output is a <em>contingency table</em> that contains the counts in each pair of categories.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">employment</span> <span class="o">=</span> <span class="n">males</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
<span class="n">employment</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Employment Status</th> <th>married</th> <th>partner</th>
</tr>
</thead>
<tbody>
<tr>
<td>not working - disabled </td> <td>44 </td> <td>20 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - looking for work </td> <td>28 </td> <td>33 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - on a temporary layoff from a job</td> <td>15 </td> <td>8 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - other </td> <td>16 </td> <td>9 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - retired </td> <td>44 </td> <td>4 </td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee </td> <td>513 </td> <td>170 </td>
</tr>
</tbody>
<tbody><tr>
<td>working, self-employed </td> <td>82 </td> <td>47 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The arguments of <code>pivot</code> are the labels of the two columns corresponding to the variables we are studying. Categories of the first argument appear as columns; categories of the second argument are the rows. Each cell of the table contains the number of men in a pair of categories – a particular employment status and a particular marital status.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The table shows that regardless of marital status, the men in the sample are most likely to be working as paid employees. But it is quite hard to compare the entire distributions based on this table, because the numbers of married and unmarried men in the sample are not the same. There are 742 married men but only 291 unmarried ones.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">employment</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>married</th> <th>partner</th>
</tr>
</thead>
<tbody>
<tr>
<td>742 </td> <td>291 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To adjust for this difference in total numbers, we will convert the counts into proportions, by dividing all the <code>married</code> counts by 742 and all the <code>partner</code> counts by 291.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">proportions</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
<span class="s2">"Employment Status"</span><span class="p">,</span> <span class="n">cc</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s2">"Employment Status"</span><span class="p">),</span>
<span class="s2">"married"</span><span class="p">,</span> <span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)),</span>
<span class="s2">"partner"</span><span class="p">,</span> <span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">employment</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">))</span>
<span class="p">])</span>
<span class="n">proportions</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Employment Status</th> <th>married</th> <th>partner</th>
</tr>
</thead>
<tbody>
<tr>
<td>not working - disabled </td> <td>0.0592992</td> <td>0.0687285</td>
</tr>
</tbody>
<tbody><tr>
<td>not working - looking for work </td> <td>0.0377358</td> <td>0.113402 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - on a temporary layoff from a job</td> <td>0.0202156</td> <td>0.0274914</td>
</tr>
</tbody>
<tbody><tr>
<td>not working - other </td> <td>0.0215633</td> <td>0.0309278</td>
</tr>
</tbody>
<tbody><tr>
<td>not working - retired </td> <td>0.0592992</td> <td>0.0137457</td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee </td> <td>0.691375 </td> <td>0.584192 </td>
</tr>
</tbody>
<tbody><tr>
<td>working, self-employed </td> <td>0.110512 </td> <td>0.161512 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The <code>married</code> column of this table shows the distribution of employment status of the married men in the sample. For example, among married men, the proportion who are retired is about 0.059. The <code>partner</code> column shows the distribution of the employment status of the unmarried men in the sample. Among unmarried men, the proportion who are retired is about 0.014.</p>
<p>The two distributions look different from each other in other ways too, as can be seen more clearly in the bar graphs below. It appears that a larger proportion of the married men in the sample work as paid employees, whereas a larger proportion of the unmarried men are not working but are looking for work.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">proportions</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_22_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The distributions of employment status of the men in the two groups – married and unmarried – is clearly different in the sample.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Are-the-two-distributions-different-in-the-population?">Are the two distributions different in the population?<a class="anchor-link" href="#Are-the-two-distributions-different-in-the-population?">¶</a></h3><p>This raises the question of whether the difference is due to randomness in the sampling, or whether the distributions of employment status are indeed different for married and umarried cohabiting men in the U.S. Remember that the data that we have are from a sample of just 1,033 couples; we do not know the distribution of employment status of married or unmarried cohabiting men in the entire country.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We can answer the question by performing a statistical test of hypotheses. Let us use the terminology that we developed for this in the previous section.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Null hypothesis.</strong> In the United States, the distribution of employment status among married men is the same as among unmarried men who live with their partners.</p>
<p>Another way of saying this is that employment status and marital status are <em>independent</em> or <em>not associated</em>.</p>
<p>If the null hypothesis were true, then the difference that we have observed in the sample would be just due to chance.</p>
<p><strong>Alternative hypothesis.</strong> In the United States, the distributions of the employment status of the two groups of men are different. In other words, employment status and marital status are associated in some way.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As our <strong>test statistic</strong>, we will use the total variation distance between two distributions.</p>
<p>The observed value of the test statistic is about 0.15:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># TVD between the two distributions in the sample</span>
<span class="n">married</span> <span class="o">=</span> <span class="n">proportions</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span>
<span class="n">partner</span> <span class="o">=</span> <span class="n">proportions</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span>
<span class="n">observed_tvd</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">married</span> <span class="o">-</span> <span class="n">partner</span><span class="p">))</span>
<span class="n">observed_tvd</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.15273571011754242</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Random-Permutations">Random Permutations<a class="anchor-link" href="#Random-Permutations">¶</a></h3><p>In order to compare this observed value of the total variation distance with what is predicted by the null hypothesis, we need to know how the total variation distance would vary across all possible random samples if employment status and marital status were not related.</p>
<p>This is quite daunting to derive by mathematics, but let us see if we can get a good approximation by simulation.</p>
<p>With just one sample at hand, and no further knowledge of the distribution of employment status among men in the United States, how can we go about replicating the sampling procedure? The key is to note that <em>if</em> marital status and employment status were not connected in any way, then we could replicate the sampling process by replacing each man's employment status by a randomly picked employment status from among all the men, married and unmarried.</p>
<p>Doing this for all the men is equivalent to randomly rearranging the entire column containing employment status, while leaving the marital status column unchanged. Such a rearrangement is called a <em>random permutation</em>.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Thus, under the null hypothesis, we can replicate the sampling process by assigning to each man an employment status chosen at random without replacement from the entries in the column <code>Employment Status</code>. We can do the replication by simply permuting the entire <code>Employment Status</code> column and leaving everything else unchanged.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let's implement this plan. First, we will shuffle the column <code>empl_status</code> using the <code>sample</code> method, which just shuffles all the rows when provided with no arguments.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Randomly permute the employment status of all men</span>
<span class="n">shuffled</span> <span class="o">=</span> <span class="n">males</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="n">shuffled</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Employment Status</th>
</tr>
</thead>
<tbody>
<tr>
<td>working, self-employed </td>
</tr>
</tbody>
<tbody><tr>
<td>working, self-employed </td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>working, self-employed </td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>not working - disabled </td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee</td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee</td>
</tr>
</tbody>
</table>
<p>... (1023 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The first two columns of the table below are taken from the original sample. The third has been created by randomly permuting the original <code>Employment Status</code> column.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Construct a table in which employment status has been shuffled</span>
<span class="n">males_with_shuffled_empl</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
<span class="s2">"Marital Status"</span><span class="p">,</span> <span class="n">males</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">),</span>
<span class="s2">"Employment Status"</span><span class="p">,</span> <span class="n">males</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">),</span>
<span class="s2">"Employment Status (shuffled)"</span><span class="p">,</span> <span class="n">shuffled</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">males_with_shuffled_empl</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Marital Status</th> <th>Employment Status</th> <th>Employment Status (shuffled)</th>
</tr>
</thead>
<tbody>
<tr>
<td>married </td> <td>working as paid employee </td> <td>working, self-employed </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee </td> <td>working, self-employed </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee </td> <td>working as paid employee </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working, self-employed </td> <td>working as paid employee </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>not working - other </td> <td>working, self-employed </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>not working - on a temporary layoff from a job</td> <td>working as paid employee </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>not working - disabled </td> <td>not working - disabled </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee </td> <td>working as paid employee </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>working as paid employee </td> <td>working as paid employee </td>
</tr>
</tbody>
<tbody><tr>
<td>married </td> <td>not working - retired </td> <td>working as paid employee </td>
</tr>
</tbody>
</table>
<p>... (1023 rows omitted)</p></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Once again, the <code>pivot</code> method computes the contingency table, which allows us to calculate the total variation distance between the distributions of the two groups of men after their employment status has been shuffled.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">employment_shuffled</span> <span class="o">=</span> <span class="n">males_with_shuffled_empl</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status (shuffled)'</span><span class="p">)</span>
<span class="n">employment_shuffled</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea output_execute_result">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Employment Status (shuffled)</th> <th>married</th> <th>partner</th>
</tr>
</thead>
<tbody>
<tr>
<td>not working - disabled </td> <td>48 </td> <td>16 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - looking for work </td> <td>44 </td> <td>17 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - on a temporary layoff from a job</td> <td>16 </td> <td>7 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - other </td> <td>18 </td> <td>7 </td>
</tr>
</tbody>
<tbody><tr>
<td>not working - retired </td> <td>39 </td> <td>9 </td>
</tr>
</tbody>
<tbody><tr>
<td>working as paid employee </td> <td>489 </td> <td>194 </td>
</tr>
</tbody>
<tbody><tr>
<td>working, self-employed </td> <td>88 </td> <td>41 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># TVD between the two distributions in the contingency table above</span>
<span class="n">e_s</span> <span class="o">=</span> <span class="n">employment_shuffled</span>
<span class="n">married</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">))</span>
<span class="n">partner</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">))</span>
<span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">married</span> <span class="o">-</span> <span class="n">partner</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.032423745611841297</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This total variation distance was computed based on the null hypothesis that the distributions of employment status for the two groups of men are the same. You can see that it is noticeably smaller than the observed value of the total variation distance (0.15) between the two groups in our original sample.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="A-Permutation-Test">A Permutation Test<a class="anchor-link" href="#A-Permutation-Test">¶</a></h3><p>Could this just be due to chance variation? We will only know if we run many more replications, by randomly permuting the <code>Employment Status</code> column repeatedly. This method of testing is known as a <strong>permutation test</strong>.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Put it all together in a for loop to perform a permutation test</span>
<span class="n">repetitions</span> <span class="o">=</span> <span class="mi">500</span>
<span class="n">tvds</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span><span class="s2">"TVD between married and partnered men"</span><span class="p">,</span> <span class="p">[])</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">repetitions</span><span class="p">):</span>
<span class="c1"># Construct a permuted table</span>
<span class="n">shuffled</span> <span class="o">=</span> <span class="n">males</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="n">combined</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
<span class="s2">"Marital Status"</span><span class="p">,</span> <span class="n">males</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">),</span>
<span class="s2">"Employment Status"</span><span class="p">,</span> <span class="n">shuffled</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Employment Status'</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">employment_shuffled</span> <span class="o">=</span> <span class="n">combined</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
<span class="c1"># Compute TVD</span>
<span class="n">e_s</span> <span class="o">=</span> <span class="n">employment_shuffled</span>
<span class="n">married</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'married'</span><span class="p">))</span>
<span class="n">partner</span> <span class="o">=</span> <span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e_s</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'partner'</span><span class="p">))</span>
<span class="n">permutation_tvd</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">married</span> <span class="o">-</span> <span class="n">partner</span><span class="p">))</span>
<span class="n">tvds</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">permutation_tvd</span><span class="p">])</span>
<span class="n">tvds</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">))</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_40_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The figure above is the <strong>empirical distribution of the total variation distance</strong> between the distributions of the employment status of married and unmarried men, under the null hypothesis.</p>
<p><strong>The observed test statistic of 0.15 is quite far in the tail, and so the chance of observing such an extreme value under the null hypothesis is close to 0</strong>.</p>
<p>As before, this chance is called an empirical P-value. The P-value is the chance that our test statistic (TVD) would come out at least as extreme as the observed value (in this case 0.15 or greater) under the null hypothesis.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Conclusion-of-the-test">Conclusion of the test<a class="anchor-link" href="#Conclusion-of-the-test">¶</a></h3><p>Our empirical estimate based on repeated sampling gives us all the information we need for drawing conclusions from the data: the observed statistic is very unlikely under the null hypothesis.</p>
<p>The low P-value constitutes <strong>evidence in favor of the alternative hypothesis</strong>. The data support the hypothesis that in the United States, the distribution of the employment status of married men is not the same as that of unmarried men who live with their partners.</p>
<p>We have just completed our first <em>permutation test</em>. Permutation tests are quite common in practice because they make very few assumptions about the underlying population and are straightforward to perform and interpret.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Note about the approximate P-value</strong></p>
<p>Our simulation gives us an approximate empirical P-value, because it is based on just 500 random permutations instead of all the possible random permutations. We can compute this empirical P-value directly, without drawing the histogram:</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">empirical_p_value</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">count_nonzero</span><span class="p">(</span><span class="n">tvds</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">>=</span> <span class="n">observed_tvd</span><span class="p">)</span> <span class="o">/</span> <span class="n">tvds</span><span class="o">.</span><span class="n">num_rows</span>
<span class="n">empirical_p_value</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.0</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Computing the exact P-value would require us to consider all possible outcomes of shuffling (which is very large) instead of 500 random shuffles. If we had performed all the random shuffles, there would have been a few with more extreme TVDs. The true P-value is greater than zero, but not by much.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Generalizing-Our-Hypothesis-Test">Generalizing Our Hypothesis Test<a class="anchor-link" href="#Generalizing-Our-Hypothesis-Test">¶</a></h3><p>The example above includes a substantial amount of code in order to investigate the relationship between two characteristics (marital status and employment status) for a particular subset of the surveyed population (males). Suppose we would like to investigate different characteristics or a different population. How can we reuse the code we have written so far in order to explore more relationships?</p>
<p>When you are about to copy your code, you should think, "Maybe I should write some functions."</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>What functions to write? A good way to make this decision is to think about what you have to compute repeatedly.</p>
<p>In our example, the total variation distance is computed over and over again. So we will begin with a generalized computation of total variation distance between the distribution of any column of values (such as employment status) when separated into any two conditions (such as marital status) for a collection of data described by any table. Our implementation includes the same statements as we used above, but uses generic names that are specified by the final function call.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># TVD between the distributions of values under any two conditions</span>
<span class="k">def</span> <span class="nf">tvd</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">):</span>
<span class="sd">"""Compute the total variation distance </span>
<span class="sd"> between proportions of values under two conditions.</span>
<span class="sd"> </span>
<span class="sd"> t (Table) -- a table</span>
<span class="sd"> conditions (str) -- a column label in t; should have only two categories</span>
<span class="sd"> values (str) -- a column label in t</span>
<span class="sd"> """</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">)</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="mf">0.5</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">a</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">-</span> <span class="n">b</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">b</span><span class="p">)))</span>
<span class="n">tvd</span><span class="p">(</span><span class="n">males</span><span class="p">,</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.15273571011754242</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next, we can write a function that performs a permutation test using this <code>tvd</code> function to compute the same statistic on shuffled variants of any table. It's worth reading through this implementation to understand its details.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">permutation_tvd</span><span class="p">(</span><span class="n">original</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Perform a permutation test of whether </span>
<span class="sd"> the distribution of values for two conditions </span>
<span class="sd"> is the same in the population,</span>
<span class="sd"> using the total variation distance between two distributions</span>
<span class="sd"> as the test statistic.</span>
<span class="sd"> </span>
<span class="sd"> original is a Table with two columns. The value of the argument</span>
<span class="sd"> conditions is the name of one column, and the value of the argument</span>
<span class="sd"> values is the name of the other column. The conditions table should</span>
<span class="sd"> have only 2 possible values corresponding to 2 categories in the</span>
<span class="sd"> data.</span>
<span class="sd"> </span>
<span class="sd"> The values column is shuffled many times, and the data are grouped</span>
<span class="sd"> according to the conditions column. The total variation distance</span>
<span class="sd"> between the proportions values in the 2 categories is computed. </span>
<span class="sd"> </span>
<span class="sd"> Then we draw a histogram of all those TV distances. This shows us </span>
<span class="sd"> what the TVD between the values of the two distributions would typically</span>
<span class="sd"> look like if the values were independent of the conditions.</span>
<span class="sd"> """</span>
<span class="n">repetitions</span> <span class="o">=</span> <span class="mi">500</span>
<span class="n">stats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">repetitions</span><span class="p">):</span>
<span class="n">shuffled</span> <span class="o">=</span> <span class="n">original</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="n">combined</span> <span class="o">=</span> <span class="n">Table</span><span class="p">()</span><span class="o">.</span><span class="n">with_columns</span><span class="p">([</span>
<span class="n">conditions</span><span class="p">,</span> <span class="n">original</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">conditions</span><span class="p">),</span>
<span class="n">values</span><span class="p">,</span> <span class="n">shuffled</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">stats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">tvd</span><span class="p">(</span><span class="n">combined</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">))</span>
<span class="n">observation</span> <span class="o">=</span> <span class="n">tvd</span><span class="p">(</span><span class="n">original</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">)</span>
<span class="n">p_value</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">count_nonzero</span><span class="p">(</span><span class="n">stats</span> <span class="o">>=</span> <span class="n">observation</span><span class="p">)</span> <span class="o">/</span> <span class="n">repetitions</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Observation:"</span><span class="p">,</span> <span class="n">observation</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Empirical P-value:"</span><span class="p">,</span> <span class="n">p_value</span><span class="p">)</span>
<span class="n">Table</span><span class="p">([</span><span class="n">stats</span><span class="p">],</span> <span class="p">[</span><span class="s1">'Empirical distribution'</span><span class="p">])</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span>
</pre></div></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">permutation_tvd</span><span class="p">(</span><span class="n">males</span><span class="p">,</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Observation: 0.152735710118
Empirical P-value: 0.0
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_51_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Now that we have generalized our permutation test, we can apply it to other hypotheses. For example, we can compare the distribution over the employment status of women, grouping them by their marital status. In the case of men we found a difference, but what about with women? First, we can visualize the two distributions.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">compare_bar</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">):</span>
<span class="sd">"""Bargraphs of distributions of values for each of two conditions."""</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">conditions</span><span class="p">,</span> <span class="n">values</span><span class="p">)</span>
<span class="k">for</span> <span class="n">label</span> <span class="ow">in</span> <span class="n">e</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">labels</span><span class="p">:</span>
<span class="c1"># Convert each column of counts into proportions</span>
<span class="n">e</span><span class="o">.</span><span class="n">append_column</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">label</span><span class="p">)</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">e</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="n">label</span><span class="p">)))</span>
<span class="n">e</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="n">compare_bar</span><span class="p">(</span><span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">),</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_53_0.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>A glance at the figure shows that the two distributions are different in the sample. The difference in the category "Not working – other" is particularly striking: about 22% of the married women are in this category, compared to only about 8% of the unmarried women. There are several reasons for this difference. For example, the percent of homemakers is greater among married women than among unmarried women, possibly because married women are more likely to be "stay-at-home" mothers of young children. The difference could also be generational: as we saw earlier, the married couples are older than the unmarried partners, and older women are less likely to be in the workforce than younger women.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>While we can see that the distributions are different in the sample, we are not really interested in the sample for its own sake. We are examining the sample because it is likely to reflect the population. So, as before, we will use the sample to try to answer a question about something unknown: the distributions of employment status of married and unmarried cohabiting women <em>in the United States</em>. That is the population from which the sample was drawn.</p>
<p>We have to consider the possibility that the observed difference in the sample could simply be the result of chance variation. Remember that our data are only from a random sample of couples. We do not have data for all the couples in the United States.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Null hypothesis</strong>: In the U.S., the distribution of employment status is the same for married women as for unmarried women living with their partners. The difference in the sample is due to chance.</p>
<p><strong>Alternative hypothesis</strong>: In the U.S., the distributions of employment status among married and unmarried cohabiting women are different.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Another-permutation-test-to-compare-distributions">Another permutation test to compare distributions<a class="anchor-link" href="#Another-permutation-test-to-compare-distributions">¶</a></h3><p>We can test these hypotheses just as we did for men, by using the function <code>permuation_tvd</code> that we defined for this purpose.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">permutation_tvd</span><span class="p">(</span><span class="n">couples</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'female'</span><span class="p">),</span> <span class="s1">'Marital Status'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Observation: 0.194755513565
Empirical P-value: 0.0
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_58_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As for the males, the empirical P-value is 0 based on a lare number of repetitions. So the exact P-value is close to 0, which is evidence in favor of the alternative hypothesis. The data support the hypothesis that for women in the United States, employment status is associated with whether they are married or unmarried and living with their partners.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Another-example">Another example<a class="anchor-link" href="#Another-example">¶</a></h3><p>Are gender and employment status independent in the population? We are now in a position to test this quite swiftly:</p>
<p><strong>Null hypothesis.</strong> Among married and unmarried cohabiting individuals in the United States, gender is independent of employment status.</p>
<p><strong>Alternative hypothesis.</strong> Among married and unmarried cohabiting people in the United States, gender and employment status are related.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">permutation_tvd</span><span class="p">(</span><span class="n">couples</span><span class="p">,</span> <span class="s1">'Gender'</span><span class="p">,</span> <span class="s1">'Employment Status'</span><span class="p">)</span>
</pre></div></div></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Observation: 0.185866408519
Empirical P-value: 0.0
</pre></div>
<div class="output_png output_subarea ">
<img src="../notebooks-images/Permutation_61_1.png"/></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The conclusion of the test is that gender and employment status are not independent in the population. This is no surprise; for example, because of societal norms, older women were less likely to have gone into the workforce than men.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Deflategate:-Permutation-Tests-and-Quantitative-Variables">Deflategate: Permutation Tests and Quantitative Variables<a class="anchor-link" href="#Deflategate:-Permutation-Tests-and-Quantitative-Variables">¶</a></h3><p>On January 18, 2015, the Indianapolis Colts and the New England Patriots played the American Football Conference (AFC) championship game to determine which of those teams would play in the Super Bowl. After the game, there were allegations that the Patriots' footballs had not been inflated as much as the regulations required; they were softer. This could be an advantage, as softer balls might be easier to catch.</p>
<p>For several weeks, the world of American football was consumed by accusations, denials, theories, and suspicions: the press labeled the topic Deflategate, after the Watergate political scandal of the 1970's. The National Football League (NFL) commissioned an independent analysis. In this example, we will perform our own analysis of the data.</p>
<p>Pressure is often measured in pounds per square inch (psi). NFL rules stipulate that game balls must be inflated to have pressures in the range 12.5 psi and 13.5 psi. Each team plays with 12 balls. Teams have the responsibility of maintaining the pressure in their own footballs, but game officials inspect the balls. Before the start of the AFC game, all the Patriots' balls were at about 12.5 psi. Most of the Colts' balls were at about 13.0 psi. However, these pre-game data were not recorded.</p>
<p>During the second quarter, the Colts intercepted a Patriots ball. On the sidelines, they measured the pressure of the ball and determined that it was below the 12.5 psi threshold. Promptly, they informed officials.</p>
<p>At half-time, all the game balls were collected for inspection. Two officials, Clete Blakeman and Dyrol Prioleau, measured the pressure in each of the balls. Here are the data; pressure is measured in psi. The Patriots ball that had been intercepted by the Colts was not inspected at half-time. Nor were most of the Colts' balls – the officials simply ran out of time and had to relinquish the balls for the start of second half play.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">football</span> <span class="o">=</span> <span class="n">Table</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s1">'football.csv'</span><span class="p">)</span>
<span class="n">football</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Team</th> <th>Ball</th> <th>Blakeman</th> <th>Prioleau</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 </td> <td>Patriots 1 </td> <td>11.5 </td> <td>11.8 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 2 </td> <td>10.85 </td> <td>11.2 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 3 </td> <td>11.15 </td> <td>11.5 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 4 </td> <td>10.7 </td> <td>11 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 5 </td> <td>11.1 </td> <td>11.45 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 6 </td> <td>11.6 </td> <td>11.95 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 7 </td> <td>11.85 </td> <td>12.3 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 8 </td> <td>11.1 </td> <td>11.55 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 9 </td> <td>10.95 </td> <td>11.35 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 10</td> <td>10.5 </td> <td>10.9 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 11</td> <td>10.9 </td> <td>11.35 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 1 </td> <td>12.7 </td> <td>12.35 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 2 </td> <td>12.75 </td> <td>12.3 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 3 </td> <td>12.5 </td> <td>12.95 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 4 </td> <td>12.55 </td> <td>12.15 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>For each of the 15 balls that were inspected, the two officials got different results. It is not uncommon that repeated measurements on the same object yield different results, especially when the measurements are performed by different people. So we will assign to each the ball the average of the two measurements made on that ball.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">football</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
<span class="s1">'Combined'</span><span class="p">,</span> <span class="p">(</span><span class="n">football</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Blakeman'</span><span class="p">)</span><span class="o">+</span><span class="n">football</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Prioleau'</span><span class="p">))</span><span class="o">/</span><span class="mi">2</span>
<span class="p">)</span>
<span class="n">football</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Team</th> <th>Ball</th> <th>Blakeman</th> <th>Prioleau</th> <th>Combined</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 </td> <td>Patriots 1 </td> <td>11.5 </td> <td>11.8 </td> <td>11.65 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 2 </td> <td>10.85 </td> <td>11.2 </td> <td>11.025 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 3 </td> <td>11.15 </td> <td>11.5 </td> <td>11.325 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 4 </td> <td>10.7 </td> <td>11 </td> <td>10.85 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 5 </td> <td>11.1 </td> <td>11.45 </td> <td>11.275 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 6 </td> <td>11.6 </td> <td>11.95 </td> <td>11.775 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 7 </td> <td>11.85 </td> <td>12.3 </td> <td>12.075 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 8 </td> <td>11.1 </td> <td>11.55 </td> <td>11.325 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 9 </td> <td>10.95 </td> <td>11.35 </td> <td>11.15 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 10</td> <td>10.5 </td> <td>10.9 </td> <td>10.7 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 11</td> <td>10.9 </td> <td>11.35 </td> <td>11.125 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 1 </td> <td>12.7 </td> <td>12.35 </td> <td>12.525 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 2 </td> <td>12.75 </td> <td>12.3 </td> <td>12.525 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 3 </td> <td>12.5 </td> <td>12.95 </td> <td>12.725 </td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 4 </td> <td>12.55 </td> <td>12.15 </td> <td>12.35 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>At a glance, it seems apparent that the Patriots' footballs were at a lower pressure than the Colts' balls. Because some deflation is normal during the course of a game, the independent analysts decided to calculate the drop in pressure from the start of the game. Recall that the Patriots' balls had all started out at about 12.5 psi, and the Colts' balls at about 13.0 psi. Therefore the drop in pressure for the Patriots' balls was computed as 12.5 minus the pressure at half-time, and the drop in pressure for the Colts' balls was 13.0 minus the pressure at half-time.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">football</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">with_column</span><span class="p">(</span>
<span class="s1">'Drop'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">12.5</span><span class="p">]</span><span class="o">*</span><span class="mi">11</span> <span class="o">+</span> <span class="p">[</span><span class="mf">13.0</span><span class="p">]</span><span class="o">*</span><span class="mi">4</span><span class="p">)</span> <span class="o">-</span> <span class="n">football</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Combined'</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">football</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div></div></div>
<div class="output_html rendered_html output_subarea ">
<table border="1" class="dataframe">
<thead>
<tr>
<th>Team</th> <th>Ball</th> <th>Blakeman</th> <th>Prioleau</th> <th>Combined</th> <th>Drop</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 </td> <td>Patriots 1 </td> <td>11.5 </td> <td>11.8 </td> <td>11.65 </td> <td>0.85 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 2 </td> <td>10.85 </td> <td>11.2 </td> <td>11.025 </td> <td>1.475</td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 3 </td> <td>11.15 </td> <td>11.5 </td> <td>11.325 </td> <td>1.175</td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 4 </td> <td>10.7 </td> <td>11 </td> <td>10.85 </td> <td>1.65 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 5 </td> <td>11.1 </td> <td>11.45 </td> <td>11.275 </td> <td>1.225</td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 6 </td> <td>11.6 </td> <td>11.95 </td> <td>11.775 </td> <td>0.725</td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 7 </td> <td>11.85 </td> <td>12.3 </td> <td>12.075 </td> <td>0.425</td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 8 </td> <td>11.1 </td> <td>11.55 </td> <td>11.325 </td> <td>1.175</td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 9 </td> <td>10.95 </td> <td>11.35 </td> <td>11.15 </td> <td>1.35 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 10</td> <td>10.5 </td> <td>10.9 </td> <td>10.7 </td> <td>1.8 </td>
</tr>
</tbody>
<tbody><tr>
<td>0 </td> <td>Patriots 11</td> <td>10.9 </td> <td>11.35 </td> <td>11.125 </td> <td>1.375</td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 1 </td> <td>12.7 </td> <td>12.35 </td> <td>12.525 </td> <td>0.475</td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 2 </td> <td>12.75 </td> <td>12.3 </td> <td>12.525 </td> <td>0.475</td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 3 </td> <td>12.5 </td> <td>12.95 </td> <td>12.725 </td> <td>0.275</td>
</tr>
</tbody>
<tbody><tr>
<td>1 </td> <td>Colts 4 </td> <td>12.55 </td> <td>12.15 </td> <td>12.35 </td> <td>0.65 </td>
</tr>
</tbody>
</table></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It is apparent that the drop was larger, on average, for the Patriots' footballs. But could the difference be just due to chance?</p>
<p>To answer this, we must first examine how chance might enter the analysis. This is not a situation in which there is a random sample of data from a large population. It is also not clear how to create a justifiable abstract chance model, as the balls were all different, inflated by different people, and maintained under different conditions.</p>
<p>One way to introduce chances is to ask whether the drops in pressures of the 11 Patriots balls and the 4 Colts balls resemble a random permutation of the 15 drops. Then the 4 Colts drops would be a simple random sample of all 15 drops. This gives us a null hypothesis that we can test using random permutations.</p>
<p><strong>Null hypothesis.</strong> The drops in the pressures of the 4 Colts balls are like a random sample (without replacement) from all 15 drops.</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="A-new-test-statistic">A new test statistic<a class="anchor-link" href="#A-new-test-statistic">¶</a></h4><p>The data are quantitative, so we cannot compare the two distributions category by category using the total variation distance. If we try to bin the data in order to use the TVD, the choice of bins can a noticeable effect on the statistic. So instead, we will work with a simple statistic based on means. We will just compare the average drops in the two groups.</p>
<p>The observed difference between the average drops in the two groups was about 0.7335 psi.</p></div></div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">patriots</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Team'</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Drop'</span><span class="p">)</span>
<span class="n">colts</span> <span class="o">=</span> <span class="n">football</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s1">'Team'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'Drop'</span><span class="p">)</span>
<span class="n">observed_difference</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">patriots</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">colts</span><span class="p">)</span>
<span class="n">observed_difference</span>
</pre></div></div></div>
<div class="output_text output_subarea output_execute_result">
<pre>0.73352272727272805</pre></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Now the question becomes: If we took a random permutation of the 15 drops, how likely is it that the difference in the means of the first 11 and the last 4 would be at least as large as the difference observed by the officials?</p></div></div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To answer this, we will randomly permute the 15 drops, assign the first 11 permuted values to the Patriots and the last 4 to the Colts. Then we will find the difference in the means of the two permuted groups.</p></div></div>