-
Notifications
You must be signed in to change notification settings - Fork 18
/
feed.r.xml
2404 lines (1852 loc) · 818 KB
/
feed.r.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>free range statistics - R</title>
<description>Posts categorised as 'R'</description>
<link>https://freerangestats.info</link>
<atom:link href="https://freerangestats.info/feed.R.xml" rel="self" type="application/rss+xml" />
<item>
<title>Prime numbers as sums of three squares. by @ellis2013nz</title>
<description><p>I was interested by a <a href="https://www.linkedin.com/posts/fermatslibrary_397-is-conjectured-to-be-the-largest-prime-activity-7242947116719915008-BRz7?utm_source=share&amp;utm_medium=member_desktop">LinkedIn post about the number 397</a>:</p>
<blockquote>
<p>“397 is conjectured to be the largest prime that can be represented uniquely as the sum of three positive squares”</p>
</blockquote>
<p>That is, 3^2 + 8^2 + 18^2 = 397</p>
<p>This led to some confusion in the comments as people found other prime numbers that can be created as the sum of three squares. But the wording is sloppy; better wording would be:</p>
<blockquote>
<p>“397 is conjectured to be the largest prime that can be represented as the sum of three positive squares of integers in exactly one way”</p>
</blockquote>
<p>Let’s confirm that. The only method I know for something like this is brute force. I can make a data frame with three columns, each with all the squares of positive integers up to some maximum point - so the data frame has every combination of those, discarding duplicate combinations by making the order of the columns strictly non-decreasing. Then we sum those three squares, and store the results:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">primes</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">glue</span><span class="p">)</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">30</span><span class="w">
</span><span class="n">squares</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">k</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">primes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">generate_primes</span><span class="p">(</span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># This is the part that gets slower with larger k as you make k^3 combinations</span><span class="w">
</span><span class="n">s3s</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">expand_grid</span><span class="p">(</span><span class="n">s1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">squares</span><span class="p">,</span><span class="w"> </span><span class="n">s2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">squares</span><span class="p">,</span><span class="w"> </span><span class="n">s3</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">squares</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">s2</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">s1</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">s3</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">s2</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">sum_3_sq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s3</span><span class="p">)</span><span class="w">
</span><span class="c1"># example:</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">s3s</span><span class="p">,</span><span class="w"> </span><span class="n">sum_3_sq</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">397</span><span class="p">)</span></code></pre></figure>
<p>This gets us our one combination that adds up to 397:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> s1 s2 s3 sum_3_sq
&lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 9 64 324 397
</code></pre></div></div>
<p>Next step is to count the number of times each result appears.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">s3s_sum</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">s3s</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">sum_3_sq</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">summarise</span><span class="p">(</span><span class="n">number_3_square_sums</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w">
</span><span class="c1"># example:</span><span class="w">
</span><span class="n">s3s_sum</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">number_3_square_sums</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">s3s</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sum_3_sq"</span><span class="p">)</span></code></pre></figure>
<p>So for example we see that 54 (which of course is not a prime - we haven’t yet filtered to primes) can be made 3 ways: as the sum of the squares of 1, 2, 7; of 2, 5, 5; and 3, 3, 6:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> sum_3_sq number_3_square_sums s1 s2 s3
&lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 54 3 1 4 49
2 54 3 4 25 25
3 54 3 9 9 36
</code></pre></div></div>
<p>Then it’s a simple matter of joining that summary (of counts of the number of ways to get a given total of three squares) to a data frame of the prime numbers, and drawing a plot of the results:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">primes</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">s3s_sum</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"p"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sum_3_sq"</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">number_3_square_sums</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">replace_na</span><span class="p">(</span><span class="n">number_3_square_sums</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">number_3_square_sums</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">annotate</span><span class="p">(</span><span class="s2">"point"</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">397</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">annotate</span><span class="p">(</span><span class="s2">"text"</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"397"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">comma</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Prime number"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of ways"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of ways to make a prime number as sum of three positive squares of integers"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"397 (circled) is the largest with exactly one way, of primes up to {comma(k^2)}."</span><span class="p">))</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">res</span><span class="o">$</span><span class="n">number_3_square_sums</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">11</span><span class="p">){</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="o">:</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">panel.grid.minor.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>There’s a bit of fiddling there to make the charts look nice for low values of k (where k is the maximum number I square). In my real life code the above is surrounded by a loop of different values of k, with the results saved as SVG and PNG images for use in this blog. All of which gets me this result:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0280-primes-squares-k30.svg" width="100%"><img src="https://freerangestats.info/img/0280-primes-squares-k30.png" width="100%" /></object>
<p>And here we see for some larger results of k:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0280-primes-squares-k100.svg" width="100%"><img src="https://freerangestats.info/img/0280-primes-squares-k100.png" width="100%" /></object>
<p>As k gets bigger of course the program gets slower to run. This method doesn’t scale well; for primes above about 250,000 the step of making a data frame of all the combinations of three squares starts taking too long. If I wanted to extend this further I’d have to find some ways to do this more efficiently, or put that step into a database that can handle bigger-than-memory data objects.</p>
<p>I’m sure there’s some interesting maths behind why this is just a “conjecture” and no-one has been able to prove it!</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0280-primes-squares-k300.svg" width="100%"><img src="https://freerangestats.info/img/0280-primes-squares-k300.png" width="100%" /></object>
<p>That’s all for today.</p>
</description>
<pubDate>Sat, 21 Sep 2024 00:00:00 +1100</pubDate>
<link>https://freerangestats.info/blog/2024/09/21/primes-squares</link>
<guid isPermaLink="true">https://freerangestats.info/blog/2024/09/21/primes-squares</guid>
</item>
<item>
<title>Stepwise selection of variables in regression is Evil. by @ellis2013nz</title>
<description><p>I’ve recently noticed that stepwise regression is still fairly popular, despite being well and truly frowned upon by well-informed statisticians. By stepwise regression, I mean any modelling strategy that involves adding or subtracting variables from a regression model on the basis that they are “significant”, reduce the Akaike Information Criterion, or increase adjusted R-squared, or in fact any other data-driven statistics.</p>
<p>This might be an automated variable procedure or it might be a matter of eyeballing the results of the first model you fit and saying (for example) “Let’s take literacy out, it’s p-value is not significant, and it will be a more parsimonious model once we do that.”.</p>
<p>And then people produce and report t tests, F tests, and so on, as though the end model was the one they always intended to run.</p>
<p>Let me clear about this. This is wrong. It’s not as disastrously wrong as, say, sorting the data separately one column at a time before you fit your model, but it’s still objectively bad. As my professor once told our class:</p>
<blockquote>
<p>“If you choose the variables in your model based on the data and then run tests on them, you are Evil; and you will go to Hell.”</p>
</blockquote>
<p>Why is it wrong? Here are the seven reasons given by Frank Harrell in his must-read classic, <em>Regression Modeling Strategies</em>:</p>
<ol>
<li>The R-squared or even adjusted R-squared values of the end model are biased high.</li>
<li>The F and Chi-square test statistics of the final model do not have the claimed distribution.</li>
<li>The standard errors of coefficient estimates are biased low and confidence intervals for effects and predictions are falsely narrow.</li>
<li>The p values are too small (there are severe multiple comparison problems in addition to problems 2. and 3.) and do not have the proper meaning, and it is difficult to correct for this.</li>
<li>The regression coefficients are biased high in absolute value and need shrinkage but this is rarely done.</li>
<li>Variable selection is made arbitrary by collinearity.</li>
<li>It allows us to not think about the problem.</li>
</ol>
<p>At the core of the problem is using statistical inference methods like p values, confidence intervals and ANOVA F tests that were designed and valid for a pre-specified model, but applying them instead to a model we have structured based on the data. The variables are selected partly based on chance, and we are giving ourselves a sneaky headstart in making a variable being significant.</p>
<p>Basically, this is the sort of thing that leads to the reproducibility crisis in science.</p>
<p>Some of the problems don’t matter as much if your goal for the model is just prediction, not interpretation of the model and its coefficients. But most of the time that I see the method used (including recent examples being distributed by so-called experts as part of their online teaching), the end model is indeed used for interpretation, and I have no doubt this is also the case with much published science. Further, even when the goal is only prediction, there are better methods like the Lasso, of dealing with a problem of a high number of variables.</p>
<p>Let’s look at a couple of simulations to show how this is a problem.</p>
<h2 id="increases-the-false-positive-rate-even-with-white-noise">Increases the false positive rate even with white noise</h2>
<p>First, let’s take a case where we simulate data that is known to have no relation at all to the response variable. In the code below I simulate 1,000 observations with 100 explanatory X variables and 1 response variable y. All of these variables are unrelated to eachother and are just normally distributed with a mean of zero and standard deviation of 1.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">glue</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">foreach</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">doParallel</span><span class="p">)</span><span class="w">
</span><span class="c1">#--------------------X not related to y--------------------</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">Sigma</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w">
</span><span class="n">diag</span><span class="p">(</span><span class="n">Sigma</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">noise</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mvrnorm</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">Sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sigma</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">full_model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">noise</span><span class="p">)</span><span class="w">
</span><span class="c1"># we get 6 variables that look 'significant' - about what</span><span class="w">
</span><span class="c1"># we'd expect, about 5% false positives:</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">full_model</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">`Pr(&gt;|t|)`</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span></code></pre></figure>
<p>When I fit a regression model of y ~ X, I should expect about five of the columns to appear ‘significant’ by conventional p value of 0.05 or less - because that’s more or less the definition of that critical cut-off value. That is, we tolerate a 1 in 20 false positive rate. In this case we have six variables below the cut-off, about what we’d expect:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Estimate Std. Error t value Pr(&gt;|t|)
V2 -0.06606672 0.03213918 -2.055644 0.04010514
V38 0.06881778 0.03406473 2.020206 0.04365814
V62 0.06263414 0.03137298 1.996436 0.04618768
V91 0.06826250 0.03302463 2.067018 0.03901824
V94 -0.07923079 0.03423568 -2.314275 0.02087724
V96 -0.07962012 0.03290373 -2.419790 0.01572689
</code></pre></div></div>
<p>Now let’s use stepwise selection, “both” directions (so we can remove variables from the model or add them), using the Akaike Information Criterion to choose a ‘better’ model at each step. This is better than just using p values, and much better than using p values and a low cut-off like 0.05, so I’m giving the stepwise method a fair go here.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stepped</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">step</span><span class="p">(</span><span class="n">full_model</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">stepped</span><span class="p">)</span><span class="w"> </span><span class="o">$</span><span class="n">coefficients</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">`Pr(&gt;|t|)`</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span></code></pre></figure>
<p>That gets us these variables showing up as ‘significant’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Estimate Std. Error t value Pr(&gt;|t|)
V2 -0.06312445 0.03046917 -2.071748 0.038550346
V4 -0.07353379 0.03222313 -2.282019 0.022702106
V32 -0.06120508 0.03094750 -1.977707 0.048241917
V38 0.06383031 0.03227612 1.977633 0.048250238
V90 0.06288076 0.03094938 2.031729 0.042450732
V91 0.07450724 0.03105172 2.399456 0.016605492
V94 -0.06617689 0.03208892 -2.062297 0.039442821
V96 -0.08052606 0.03073565 -2.619957 0.008930218
</code></pre></div></div>
<p>So the net impact of this fancy-looking automated procedure is to worsen our false positive rate from 6% to 8%.</p>
<p>OK, that’s just one dataset. Let’s try it with a range of others, of different sample sizes, and to make things more interesting let’s let the X variables sometimes be correlated with eachother. The stepwise selection process can be a bit slow so I spread the 700 runs of the simulation below over seven parallel processes:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># set up parallel processing cluster</span><span class="w">
</span><span class="n">cluster</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">makeCluster</span><span class="p">(</span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="c1"># only any good if you have at least 7 processors :)</span><span class="w">
</span><span class="n">registerDoParallel</span><span class="p">(</span><span class="n">cluster</span><span class="p">)</span><span class="w">
</span><span class="n">clusterEvalQ</span><span class="p">(</span><span class="n">cluster</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">foreach</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">glue</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">scales</span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">noise_results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">foreach</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">700</span><span class="p">,</span><span class="w"> </span><span class="n">.combine</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbind</span><span class="p">)</span><span class="w"> </span><span class="o">%dopar%</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">400</span><span class="p">,</span><span class="w"> </span><span class="m">800</span><span class="p">,</span><span class="w"> </span><span class="m">1600</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">r</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.7</span><span class="p">)</span><span class="w">
</span><span class="n">Sigma</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w">
</span><span class="n">diag</span><span class="p">(</span><span class="n">Sigma</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">noise</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mvrnorm</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">Sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sigma</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">full</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">noise</span><span class="p">)</span><span class="w">
</span><span class="n">stepped</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">stepAIC</span><span class="p">(</span><span class="n">full</span><span class="p">,</span><span class="w"> </span><span class="n">trace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="c1"># count the false positives</span><span class="w">
</span><span class="n">false_pos1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">full</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">`Pr(&gt;|t|)`</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">nrow</span><span class="p">()</span><span class="w">
</span><span class="n">false_pos2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">stepped</span><span class="p">)</span><span class="o">$</span><span class="n">coefficients</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">`Pr(&gt;|t|)`</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">nrow</span><span class="p">()</span><span class="w">
</span><span class="n">tibble</span><span class="p">(</span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">full</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">false_pos1</span><span class="p">,</span><span class="w"> </span><span class="n">stepped</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">false_pos2</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">noise_results</span><span class="p">)</span><span class="w">
</span><span class="n">noise_results</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">`All variables included`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">full</span><span class="p">,</span><span class="w">
</span><span class="n">`Stepwise selection of variables`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stepped</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">n2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"Sample size: {n}"</span><span class="p">),</span><span class="w">
</span><span class="n">n2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fct_reorder</span><span class="p">(</span><span class="n">n2</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">method</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">seed</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">n2</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">r</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">method</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">n2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">,</span><span class="w"> </span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of false positives -\nvariables returned as 'significant'"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Correlation of the X predictor variables"</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"False positive rates when using stepwise variable selection"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Models with 100 X explanatory variables that are in truth unrelated to Y; expecting 5 falsely 'significant' variables.
Small sample sizes make the false positive problem for stepwise selection of variables; multicollinearity in the X when no relation to the Y doesn't matter."</span><span class="p">)</span></code></pre></figure>
<p>The average false positive rate of the full model is 5.1%; for the stepwise variable selection it is 9.5%. In the chart below we can see that sample size relative to the number of variables in X matters a lot here:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0279-noisy.svg" width="100%"><img src="https://freerangestats.info/img/0279-noisy.png" width="100%" /></object>
<p>For example, the case with the sample size of only 200 observations gets a false positive rate above 15% from the stepwise method. But even with larger samples, we take a hit in false positives from the stepwise approach. The degree of multicollinearity in the X doesn’t seem to make much difference.</p>
<p>It might seem unfair to have a model with 100 explanatory variables and only 200 observations, but out there on the internet (I’m not going to link) there are guides telling you it is ok to do this procedure even when you have more variables than observations. In fact I have a horrible fear that this practice might be common in some parts of science. You can imagine how doing <em>that</em> is basically a machine for generating false, non-reproducible findings.</p>
<h2 id="even-the-correctly-retained-variables-coefficients-are-biased-big">Even the correctly-retained variables’ coefficients are biased big</h2>
<p>The above simulation was pure noise so everything was a false positive. What does stepwise variable selection do in a more realistic case where some of the variables are correctly in the model and are related to y?</p>
<p>To explore this I wrote a function (code a little way further down the blog) to simulate data with 15 X correlated variables and 1 y variable. The true model is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>y = V1 + V2 + V3 + V4 + V5 + V6 + V7 + 0.1 (V8 + V9 + V10) + e
</code></pre></div></div>
<p>That is, the true regression coefficients for variables V1 to V7 are 1; for V8, V9 and V10 they are 0.1; for the the remaining 5 variables there is no structural relationship to y.</p>
<p>When we simulate 50 data sets of this sort and use stepwise variable selection to regress y on X, here are the coefficients we get. Each point represents the coefficient for one variable from one of those runs.</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0279-main-sim.svg" width="100%"><img src="https://freerangestats.info/img/0279-main-sim.png" width="100%" /></object>
<p>The large dots on zero indicate the multiple runs in which that particular variable was not included in the final model. We see:</p>
<ul>
<li>Many occasions, variables V11 to V15 were rightly excluded, but a smattering of occasions they do get included in the model.</li>
<li>A lot of false negatives - variables V1 to V10 that should be found in the model and aren’t</li>
<li>Worse, when one of variables V1 to V10 is correctly included in the final model, the coefficient estimated for it is <em>always</em> (in this dataset) larger than the true coefficient (which remember should be 1 or 0.1 - the correct values shown by the red crosses).</li>
</ul>
<p>For comparison, here is an equivalent chart for when we fit the full model to these data. There’s a lot of variation in the coefficient estimates, but at least they’re not biased (that is, on average they are correct, their expected value is the true value):</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0279-main-sim-full.svg" width="100%"><img src="https://freerangestats.info/img/0279-main-sim-full.png" width="100%" /></object>
<p>Here’s the code for the function simulating that data and drawing the plots:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#---------------------when X is related to y--------------</span><span class="w">
</span><span class="cd">#' @param xcm correlations of the X variables with eachother, as a multiplier of</span><span class="w">
</span><span class="cd">#' their standard deviation (all X variables have the same variance / sd of 1)</span><span class="w">
</span><span class="cd">#' @param ysdm standard deviation of the y variable, expressed as a multiplier of variance</span><span class="w">
</span><span class="cd">#' of the X</span><span class="w">
</span><span class="cd">#' @param n sample size</span><span class="w">
</span><span class="cd">#' @param k number of columns in X. Only currently works if this is 15 (because of the hard-coded true_coef)</span><span class="w">
</span><span class="cd">#' @param runs number of simulations to run</span><span class="w">
</span><span class="cd">#' @param seed random seed for reproducibility</span><span class="w">
</span><span class="n">sim_steps</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">xcm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="n">ysdm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w"> </span><span class="n">runs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">321</span><span class="p">){</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">()</span><span class="w">
</span><span class="n">true_coef</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">runs</span><span class="p">){</span><span class="w">
</span><span class="n">Sigma</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">xcm</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w">
</span><span class="n">diag</span><span class="p">(</span><span class="n">Sigma</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="c1"># Sigma, not sigma squared</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mvrnorm</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">Sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sigma</span><span class="p">)</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">true_coef</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">ysdm</span><span class="p">))</span><span class="w">
</span><span class="n">mod</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">d</span><span class="p">)</span><span class="w">
</span><span class="n">step_mod</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">stepAIC</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">trace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">cm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">mod</span><span class="p">)[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">csm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">step_mod</span><span class="p">)[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">results</span><span class="p">,</span><span class="w">
</span><span class="n">tibble</span><span class="p">(</span><span class="w">
</span><span class="n">variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">csm</span><span class="p">),</span><span class="w">
</span><span class="n">coefficient</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">csm</span><span class="p">,</span><span class="w">
</span><span class="n">run</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Stepwise"</span><span class="w">
</span><span class="p">),</span><span class="w">
</span><span class="n">tibble</span><span class="p">(</span><span class="w">
</span><span class="n">variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">cm</span><span class="p">),</span><span class="w">
</span><span class="n">coefficient</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cm</span><span class="p">,</span><span class="w">
</span><span class="n">run</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Full model"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">true_coef_df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="w">
</span><span class="n">variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">d</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="n">k</span><span class="p">],</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">d</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="n">k</span><span class="p">]),</span><span class="w">
</span><span class="n">coefficient</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">true_coef</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">results</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">complete</span><span class="p">(</span><span class="n">run</span><span class="p">,</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">coefficient</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">true_coef_df</span><span class="o">$</span><span class="n">variable</span><span class="p">))</span><span class="w">
</span><span class="n">biases_df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">results</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">coefficient</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">true_coef_df</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"variable"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">summarise</span><span class="p">(</span><span class="n">bias</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="n">coefficient.x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">coefficient.y</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">r2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span><span class="o">$</span><span class="n">r.squared</span><span class="p">,</span><span class="w"> </span><span class="n">summary</span><span class="p">(</span><span class="n">step_mod</span><span class="p">)</span><span class="o">$</span><span class="n">r.squared</span><span class="p">),</span><span class="w">
</span><span class="n">xcm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xcm</span><span class="p">,</span><span class="w">
</span><span class="n">ysdm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ysdm</span><span class="p">,</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">biases</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">pull</span><span class="p">(</span><span class="n">biases_df</span><span class="p">,</span><span class="w"> </span><span class="n">bias</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">biases</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">pull</span><span class="p">(</span><span class="n">biases_df</span><span class="p">,</span><span class="w"> </span><span class="n">model</span><span class="p">)</span><span class="w">
</span><span class="n">mclabel</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
</span><span class="nf">abs</span><span class="p">(</span><span class="n">xcm</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"negligible"</span><span class="p">,</span><span class="w">
</span><span class="nf">abs</span><span class="p">(</span><span class="n">xcm</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.19</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"mild"</span><span class="p">,</span><span class="w">
</span><span class="nf">abs</span><span class="p">(</span><span class="n">xcm</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.39</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"medium"</span><span class="p">,</span><span class="w">
</span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"strong"</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">paste</span><span class="p">(</span><span class="s2">"multicollinearity"</span><span class="p">)</span><span class="w">
</span><span class="n">p1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">results</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Stepwise"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">coefficient</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">coefficient</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">true_coef_df</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_size_area</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">10</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Coefficient value"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Variable"</span><span class="p">,</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of observations:\n(usually only one, except when coefficient dropped altogether)"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Stepwise regression returns coefficient estimates biased away from zero"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"Black dots show coefficient estimates from one run of a stepwise (AIC-based) model fitting. Red squares show correct values.
Coefficients of variables left in the model are on average {biases['Stepwise']} too large (compared to real value of 0, 0.1 or 1).
Also, real explanatory variables are often dropped. Fake ones are often included."</span><span class="p">),</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"Simulated data with a model that explains about {percent(summary(step_mod)$r.squared)} of variation in response variable, with {mclabel}, by https://freerangestats.info"</span><span class="p">))</span><span class="w">
</span><span class="n">p2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">results</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"Stepwise"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">coefficient</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">coefficient</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">true_coef_df</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Coefficient value"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Variable"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Using the full model returns unbiased coefficient estimates"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"Black dots show coefficient estimates from one run of a all-variables-in model fitting. Red squares show correct values.
Coefficients of variables left in the model are on average {biases['Full model']} too large (compared to real value of 0, 0.1 or 1)."</span><span class="p">),</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"Simulated data with a model that explains about {percent(summary(mod)$r.squared)} of variation in response variable, with mild multicollinearity, by https://freerangestats.info"</span><span class="p">))</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">results</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">results</span><span class="p">,</span><span class="w">
</span><span class="n">p1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p1</span><span class="p">,</span><span class="w">
</span><span class="n">p2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p2</span><span class="p">,</span><span class="w">
</span><span class="n">biases_df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">biases_df</span><span class="w">
</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">my_sim</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sim_steps</span><span class="p">()</span><span class="w">
</span><span class="n">my_sim</span><span class="o">$</span><span class="n">p1</span><span class="w"> </span><span class="c1"># plot results for stepwise selection</span><span class="w">
</span><span class="n">my_sim</span><span class="o">$</span><span class="n">p2</span><span class="w"> </span><span class="c1"># plot results for full model</span></code></pre></figure>
<p>Now, you can see that there are a few arbitrary aspects to that simulation - in particular the sample size, the variance of the y relative to the X, and the multicollinearity of the X. The idea of having it as a function is that you can play around with these and see the impacts; for example, if you make the variance of y smaller compared to that of the X, the model gets better at explaining the variance and the stepwise algorithm is less prone to getting things wrong. Rather than include a whole bunch of individual cases, I ran some more simulations covering a range of such values so we can see the relationship to those parameters of the average bias in the estimated regression coefficients remaining in the model.</p>
<p>So here is the relationship of that bias to the R-squared of the model, at various levels of correlation between the X variables.</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0279-rsquared-bias.svg" width="100%"><img src="https://freerangestats.info/img/0279-rsquared-bias.png" width="100%" /></object>
<p>And here is the relationship of the bias to sample size, standard deviation of Y, and correlation between the X all in one chart:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0279-corr-sd-bias.svg" width="100%"><img src="https://freerangestats.info/img/0279-corr-sd-bias.png" width="100%" /></object>
<p>Of the two visualisations I probably prefer that last one, the heat map. First, it dramatically shows (all that white) that the regression estimates from the true model aren’t biased at all. Secondly, it nicely shows that the bias in the estimates returned by stepwise regression are worse</p>
<ul>
<li>for smaller samples</li>
<li>with higher correlation between the X variables</li>
<li>and with high variance of the y variable</li>
</ul>
<p>So, finally, here’s the code for those simulations:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># parameters to run this for:</span><span class="w">
</span><span class="n">var_params</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">expand_grid</span><span class="p">(</span><span class="n">xcm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="o">:</span><span class="m">9</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">ysdm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">9</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">))</span><span class="w">
</span><span class="c1"># export onto the cluster some objects we need to use:</span><span class="w">
</span><span class="n">clusterExport</span><span class="p">(</span><span class="n">cluster</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"sim_steps"</span><span class="p">,</span><span class="w"> </span><span class="s2">"var_params"</span><span class="p">))</span><span class="w">
</span><span class="c1"># run all the simulations</span><span class="w">
</span><span class="n">many_params</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">foreach</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">var_params</span><span class="p">),</span><span class="w"> </span><span class="n">.combine</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbind</span><span class="p">)</span><span class="w"> </span><span class="o">%dopar%</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sim_steps</span><span class="p">(</span><span class="n">xcm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var_params</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="o">$</span><span class="n">xcm</span><span class="p">,</span><span class="w">
</span><span class="n">ysdm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var_params</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="o">$</span><span class="n">ysdm</span><span class="p">,</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var_params</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="o">$</span><span class="n">n</span><span class="p">,</span><span class="w">
</span><span class="n">runs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="o">$</span><span class="n">biases_df</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># linechart plot:</span><span class="w">
</span><span class="n">many_params</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"n = {n}"</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r2</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">bias</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_grid</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">as.ordered</span><span class="p">(</span><span class="n">xcm</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"loess"</span><span class="p">,</span><span class="w"> </span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">span</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0.8</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"R-squared (proportion of Y's variance explained by model)"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Average bias of estimated variable coefficients"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Bias in regression coefficients after stepwise selection of variables"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Bias is worst with small samples, models with low R-squared, and correlation in the explanatory variables (shown from 0.1 to 0.9).
Model with 15 explanatory 'X' variables. Correct values of coefficients are 0, 0.1 or 1; so a bias of +1 is very serious."</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Model fitting procedure"</span><span class="p">)</span><span class="w">
</span><span class="c1"># heatmap plot:</span><span class="w">
</span><span class="n">many_params</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"n = {n}"</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xcm</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ysdm</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">bias</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_grid</span><span class="p">(</span><span class="n">model</span><span class="o">~</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_tile</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_gradientn</span><span class="p">(</span><span class="n">colours</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"white"</span><span class="p">,</span><span class="w"> </span><span class="s2">"steelblue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"darkred"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Correlation between the X variables"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Standard deviation of the Y variable"</span><span class="p">,</span><span class="w">
</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Average bias of estimated variable coefficients:"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Bias in regression coefficients after stepwise selection of variables"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Bias is worst with small samples, high variance response, and correlation in the explanatory variables.
Model with 15 explanatory 'X' variables. Correct values of coefficients are 0, 0.1 or 1; so a bias of +1 is very serious."</span><span class="p">)</span></code></pre></figure>
<p>That’s all folks. Just don’t use stepwise selection of variables. Don’t use automated addition and deletion of variables, and don’t take them out yourself by hand “because it’s not significant”. Use theory-driven model selection if it’s explanation you’re after, Bayesian methods are going to be good too as a complement to that and forcing you to think about the problem; and for regression-based prediction use a lasso or elastic net regularization.</p>
</description>
<pubDate>Sat, 14 Sep 2024 00:00:00 +1100</pubDate>
<link>https://freerangestats.info/blog/2024/09/14/stepwise</link>
<guid isPermaLink="true">https://freerangestats.info/blog/2024/09/14/stepwise</guid>
</item>
<item>
<title>Gender and sexuality in Australian surveys and census by @ellis2013nz</title>
<description><h2 id="the-2021-abs-standard-for-sex-gender-variations-of-sex-characteristics-and-sexual-orientation-variables">The 2021 ABS Standard for ‘Sex, Gender, Variations of Sex Characteristics and Sexual Orientation Variables’</h2>
<p>Over the past two weeks there has been <a href="https://www.abc.net.au/news/2024-09-01/anthony-albanese-census-gender-identity-intersex/104296702">quite a controversy</a> relating to questions about sexuality and gender in the next Australian Census of Population and Housing, in 2026. Things hit the news when the testing of “new questions” was withdrawn, then there was an announcement that “the question” will be included; then apparent clarification that one of “the questions” will be but not all of them.</p>
<p>My work includes understanding risks for these sorts of controversies (in other countries, not Australia) so I spent a bit of time getting my head around it. The Australian Broadcasting Corporation reports that the <a href="https://www.abc.net.au/news/2024-09-06/2026-census-questions-revealed/104321662">three originally proposed questions were</a>:</p>
<ul>
<li>What is the person’s gender?</li>
<li>How does the person describe their sexual orientation?”</li>
<li>Has the person been told they were born with a variation of sex characteristics?</li>
</ul>
<p>The are essentially three of <a href="https://www.abs.gov.au/statistics/standards/standard-sex-gender-variations-sex-characteristics-and-sexual-orientation-variables/latest-release">the standard questions on sex, gender, sexual orientation</a> in the relevant Australian Bureau of Statistics (ABS) Standard that was released in early 2021 and has had some minor tweaks and corrections since. A fourth question, apparently not up for debate, is:</p>
<ul>
<li>What was [your/the person’s name/their] sex recorded at birth?</li>
</ul>
<p>The standard options for this are “Male”, “Female”, and “Another term (please specify)”.</p>
<p>At the time of writing, it looks like the sexual orientation and sex at birth questions <em>will</em> be included in the Census but not those on gender and variation of sex characteristics. I guess this reflects a perception that Australian society / politics has a definite widespread acceptance of variation in sexual orientation - it’s generally seen that people can sleep with and indeed marry whomever they choose. What hasn’t yet become as common apparently is acceptance of a modest degree of complexity in sexual characteristics and gender identity.</p>
<p>The Australian Standard is well worth a read in full for anyone who has to ask or analyse survey questions relating to sex and gender. It’s a good piece of work, was very thoroughly researched and consulted on. The concepts are explained clearly, most critically the difference between sex relating to physical characteristics and gender as a social and cultural concept.</p>
<p>The standard isn’t just a list of questions and the wording to use for the answer options (although it does give this, and the carefully defined acceptable variants in limited circumstances, in explicit detail of course). Some of the elements that may not be obvious to people new to this field include, for example, strict rules that the questions have to be asked exactly as written. Interviewers are forbidden from inferring either sex or gender from appearance, names, etc. Or that the preferred word to use for statistics relating to the whole population is “Persons”, not “Total” or “People”, when presenting total population counts for sex.</p>
<p>As per the Standard, the key variable in the “Cisgender and Trans and Gender Diverse Classification” has to be derived by comparing the answer to “What was this person’s sex recorded at birth” and “What is this person’s gender”.</p>
<p>There’s a decision table in the Standard on how to do this; it gets complicated when one or both of the questions had “Another Term” selected for sex at birth, or “Prefer not to answer” to gender now. In either of these cases the correct value for Cisgender and Trans and Gender Divsere variable becomes “Inadequately described”.</p>
<p>If sex at birth differs from gender now, or gender now is “non-binary” or “different term”, the person is classified as “trans”. If sex at birth matches gender now, the person is classified as “cis”. All this of course looks straightforward and simple when set out clearly. It makes sense, and flows naturally from the realisation that cis/trans status is quite different from gender now .</p>
<p>It is both offensive and leads to poor quality data for example to give people a choice of male, female, trans-male, trans-female in response to a question on gender. This is precisely because the whole point of “transitioning” for many people is that now you are just whatever the final gender is, not an amalgam of previous and current genders - “trans-women <em>are women</em>”, and so on. So the correct statistical procedure is to ask separate questions on sex at birth and gender now, and combine the two to determine cis / trans status.</p>
<h2 id="whats-the-information-for-and-what-we-know">What’s the information for, and what we know</h2>
<p>The outcry from the LGBTIQ+ community in response to some of these questions being reported as scrapped from the next Census had it seemed to me two substantive points:</p>
<ol>
<li>More detailed information on these issues is needed for government policy and program decision-making, particularly in health and social policy.</li>
<li>There is a “right to be counted” and recognised for who people are, and the Census is an important part of the state validating this aspect of people’s existence.</li>
</ol>
<blockquote>
<p>“Without comprehensive and inclusive Australian census data, the full diversity of LGBTQ+ communities remain invisible and marginalised. This unnecessary lack of data hinders proper coordination of public health measures and social policy. We need political leadership.”</p>
</blockquote>
<p><em><a href="https://aus.social/@Lukas/113035241786325182">Lukas Parker on Mastodon</a></em></p>
<p>I think the second of these two arguments - about the rights and symbolism - is actually the stronger.</p>
<p>Important background on why this is such a sore point is that in the 2021 Census, there was a question on sex that was designed before the 2021 Standard came fully into force (Census questionnaires take years to develop). The very small proportion of people who picked “non-binary” in response to the question on sex was <a href="https://www.abc.net.au/news/2022-10-07/latham-one-nation-transgender-abs-census-population-problematic/101507074">unhelpfully picked up by a politician as evidence that there were hardly any transgender people in Australia</a>. This is clearly an invalid interpretation of a question about sex (not about gender, and certainly not about cis/trans gender status), but one that is unfortunately likely to be repeated in the lack of clearly more appropriate estimates.</p>
<p>Note that, as described above, the Standard now would require “Another term (please specify)” as the valid third choice for a question on sex at birth. But this still gets one nowhere in terms of counting the number of trans people; a second question on “gender now” is required.</p>
<p>For health or social policy planning, any question in the Census needs to be justified in terms of the extra benefits gained from making <em>everyone</em> in the country (not just a survey sample) fill it in. This might be justified, for example, because very fine level of spatial detail is needed that could never be provided by surveys; or a definitive population total is needed for survey design purposes; or we need to know this information on all individuals for later analysis with integrated data.</p>
<p>I have in fact seen the argument that these population totals are needed for survey calibration and design made in this recent debate but unfortunately have lost the link. But I don’t think it’s particularly compelling, and have yet to see a good technical argument to the contrary. I believe that survey data could give us good enough estimates of the totals for decision-making, down to number of persons per state, and understanding of characteristics, and creating purposive samples for any given relationship to population proportions.</p>
<p>This is certainly the case for sexual orientation, where we can go (for example) to the latest (2020) <a href="https://www.abs.gov.au/statistics/people/people-and-communities/general-social-survey-summary-results-australia/2020#data-downloads">General Social Survey Table 5</a> and see that 773,000 (our of 20.3m) persons aged 15 or over are estimated to be Gay, Lesbian, Bisexual or ‘Other’.</p>
<p>The relative margin of error for that estimate is 9.0%, so perhaps a confidence interval would be around 700 and 850 thousand persons - between 3.5% and 4.2% of the total persons aged 15 and over. 9% isn’t great, but I’m confident the measurement error is a bigger risk than sampling error here. That is, people who either don’t understand, don’t know, or don’t want to tell an interviewer (or a government form online) their sexual orientation. A census would remove the sampling error but doesn’t help measurement error.</p>
<p>Relating to the number of trans persons in Australia, an estimate could be derived from the <a href="https://www.abs.gov.au/methodologies/national-health-survey-methodology/2022#overview">2022 National Health Survey</a>, which included all four necessary questions in the sex, gender, etc Standard mentioned above. Surprisingly given the public interest in the issue, I haven’t been able to find such an estimate published anywhere. From eyeballing and back-of-enveloping the size of confidence intervals in one of the reports based on it, the <a href="https://www.abs.gov.au/statistics/health/mental-health/national-study-mental-health-and-wellbeing/2020-2022">National Study of Mental Health and Wellbeing</a>, I infer that trans people were between 1% and 4% of the <em>sample</em> of the National Health Survey. However, as young people were deliberately oversampled in the survey and we definitely think age is correlated to trans status, this is surely an overestimate of the population compared to what you could do with the original microdata.</p>
<p>Someone should do this! Or if it exists, let me know in the comments. One hour’s work (for someone with access to the microdata), compared to the very large cost of including a question in the census. Alternatively, if the data isn’t good enough for estimating the number of trans persons in Australia, we should have a discussion of why and what to do about it. Many of the candidate reasons why - to do with measurement or validity error of some kind - would also mean that using the Census for this purpose wouldn’t help.</p>
<p>Anyway, my blog is about playing around with data and charts and not politics, so let’s go to a visualisation of some of the published results from the 2020 General Social Survey. This survey included the question on sexual orientation and a good range of results are reported from it. After a lot of iteration I came up with this chart:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0278-lollipops.svg" width="100%"><img src="https://freerangestats.info/img/0278-lollipops.png" width="100%" /></object>
<p>You may need to zoom in on this, but it should be readable on most screens. In this chart I have selected the survey questions that had the largest estimated ratio of difference between heterosexuals and gay, lesbian or bisexual choosing that answer, having first filtered out any questions with less than 5 percentage points of absolute difference. There’s a lot packed in but there’s some interesting things here:</p>
<ul>
<li>Why are more young people reporting as gay, lesbian or bisexual? Obviously there are several candidate explanations: people become straighter as they get older (unlikely); more people are gay, lesbian or bisexual than before; or more people are <em>admitting to themselves and interviewers</em> that they are gay, lesbian or bisexual than before. Most likely the answer is some version of this latter point. Cultural norms have changed astonishingly fast, in historical terms, in this space so it’s not surprising that there is a generational difference in reported prevalence.</li>
<li>Gay, lesbian and bisexual people apparently less likely to be unemployed and more likely to be in the highest income quintile</li>
<li>Gay, lesbian and bisexual people much more likely to be not married or in a de facto marriage than heterosexual people.</li>
<li>Sadly but unsurprisingly, gay, lesbian and bisexual people much more likely to have experienced discrimination or disagree with the statement that the police and justice system can be trusted</li>
</ul>
<p>Note that I am very deliberately using the expression “LGB+” here in these charts, not the familiar “LGBTIQ+” expression. LGBTIQ of course does not aspire to be a statistical classification and is more like a social or political coalition; it combines sexual orientation (LGB), sexual characteristics (the “I” for Intersex ) and trans/cis gender status (the “T” for Trans and possibly the “Q” for Queer, although that could really refer to some combination of the three concepts).</p>
<p>Here’s the code that downloads the data, tidies it up and presents that chart:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">readxl</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">glue</span><span class="p">)</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="s2">"https://www.abs.gov.au/statistics/people/people-and-communities/general-social-survey-summary-results-australia/2020/GSS_Table5.xlsx"</span><span class="p">,</span><span class="w">
</span><span class="n">destfile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gss_table5.xlsx"</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wb"</span><span class="p">)</span><span class="w">
</span><span class="n">demog_cats</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="s2">"Sex"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Whether currently smokes"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Age group"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Employed"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Level of highest non-school qualification"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Engagement in employment or study"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Family composition of household"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Marital status"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Main Source of Household Income"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Current weekly household equivalised gross income quintiles"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_excel</span><span class="p">(</span><span class="s2">"gss_table5.xlsx"</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="n">sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Table 5.1_Estimate"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">...1</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">variable</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">Heterosexual</span><span class="p">),</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">tidyr</span><span class="o">::</span><span class="n">fill</span><span class="p">(</span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="n">.direction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"down"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">Heterosexual</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">sequence</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">sexuality</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">sequence</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">value</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">sexuality</span><span class="p">,</span><span class="w"> </span><span class="n">category</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">prop</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">value</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fct_reorder</span><span class="p">(</span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">sequence</span><span class="p">),</span><span class="w">
</span><span class="n">var_wrap</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fct_reorder</span><span class="p">(</span><span class="n">str_wrap</span><span class="p">(</span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">),</span><span class="w"> </span><span class="n">sequence</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">cat_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">demog_cats</span><span class="p">,</span><span class="w">
</span><span class="s2">"Characteristics"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Experiences and attitudes"</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">sexuality</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">sexuality</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Gay, Lesbian or Bisexual"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Gay, Lesbian, Bisexual or Other"</span><span class="p">,</span><span class="w"> </span><span class="n">sexuality</span><span class="p">))</span><span class="w">
</span><span class="c1"># check no typos in the demography categories</span><span class="w">
</span><span class="n">stopifnot</span><span class="p">(</span><span class="nf">all</span><span class="p">(</span><span class="n">demog_cats</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">d</span><span class="o">$</span><span class="n">category</span><span class="p">))</span><span class="w">
</span><span class="c1"># some categories have long answers and are difficult to present on a chart</span><span class="w">
</span><span class="n">difficult_cats</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Community involvement"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Cultural tolerance and discrimination"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Family and community support"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Crime and safety"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Stressors"</span><span class="p">)</span><span class="w">
</span><span class="c1"># draw basic faceted barchart</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="n">category</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">difficult_cats</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">sexuality</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"Total persons"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">variable</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"Persons aged 15 years and over"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var_wrap</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">prop</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sexuality</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_col</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dodge"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">str_wrap</span><span class="p">(</span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="m">35</span><span class="p">),</span><span class="w"> </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Comparison of LGB+ and heterosexual attitudes"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Selected questions from Australia's General Social Survey, 2020"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">percent</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.8</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">))</span><span class="w">
</span><span class="n">sc</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1.1</span><span class="w">
</span><span class="c1"># better visualisation that focuses on results.</span><span class="w">
</span><span class="c1"># number of questions to pick to highlight:</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">22</span><span class="w">
</span><span class="n">d2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">sexuality</span><span class="p">,</span><span class="w"> </span><span class="n">prop</span><span class="p">,</span><span class="w"> </span><span class="n">cat_type</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">spread</span><span class="p">(</span><span class="n">sexuality</span><span class="p">,</span><span class="w"> </span><span class="n">prop</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">ratio</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`Gay, Lesbian, Bisexual or Other`</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Heterosexual</span><span class="p">,</span><span class="w">
</span><span class="n">rabs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pmax</span><span class="p">(</span><span class="n">ratio</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">ratio</span><span class="p">),</span><span class="w">
</span><span class="n">diff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`Gay, Lesbian, Bisexual or Other`</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">Heterosexual</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
</span><span class="n">category</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Age group"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Marital status"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Employed"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Engagement in employment or study"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w">
</span><span class="nf">as.character</span><span class="p">(</span><span class="n">variable</span><span class="p">),</span><span class="w">
</span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"{category}:\n'{variable}'"</span><span class="p">)))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">lab_seq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
</span><span class="n">grepl</span><span class="p">(</span><span class="s2">"employ"</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">-150</span><span class="p">,</span><span class="w">
</span><span class="n">grepl</span><span class="p">(</span><span class="s2">"income"</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">-200</span><span class="p">,</span><span class="w">
</span><span class="n">cat_type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Characteristics"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="o">-</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">as.factor</span><span class="p">(</span><span class="n">label</span><span class="p">)),</span><span class="w">
</span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">diff</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">label2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fct_reorder</span><span class="p">(</span><span class="n">label</span><span class="p">,</span><span class="w"> </span><span class="n">lab_seq</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">rabs</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="c1"># don't show any where the absolute difference is less than 5 percentage points:</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">diff</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">k</span><span class="p">)</span><span class="w">
</span><span class="n">d3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">d2</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">label2</span><span class="p">,</span><span class="w"> </span><span class="n">glb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`Gay, Lesbian, Bisexual or Other`</span><span class="p">,</span><span class="w">
</span><span class="n">hs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Heterosexual</span><span class="p">,</span><span class="w"> </span><span class="n">ratio</span><span class="p">,</span><span class="w"> </span><span class="n">cat_type</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">glb2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">glb</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">hs</span><span class="p">,</span><span class="w"> </span><span class="n">glb</span><span class="w"> </span><span class="m">-0.026</span><span class="p">,</span><span class="w"> </span><span class="n">glb</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0.026</span><span class="p">))</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="c1"># variable for using to draw colour of segments and arrows:</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">ratio</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s2">"Gay, Lesbian, Bisexual or Other"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Heterosexual"</span><span class="p">))</span><span class="w">
</span><span class="c1"># draw lollipop / arrow plot</span><span class="w">
</span><span class="n">d2</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">label2</span><span class="p">,</span><span class="w"> </span><span class="n">`Gay, Lesbian, Bisexual or Other`</span><span class="p">,</span><span class="w"> </span><span class="n">Heterosexual</span><span class="p">,</span><span class="w"> </span><span class="n">cat_type</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">label2</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">cat_type</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">label2</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">cat_type</span><span class="p">,</span><span class="w"> </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free_y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_segment</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">d3</span><span class="p">,</span><span class="w"> </span><span class="n">linewidth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.1</span><span class="p">,</span><span class="w">
</span><span class="n">aes</span><span class="p">(</span><span class="n">yend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">label2</span><span class="p">,</span><span class="w"> </span><span class="n">xend</span><span class="w"> </span><span class="o">=</span><span class="n">glb2</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hs</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stage</span><span class="p">(</span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w">
</span><span class="n">after_scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">prismatic</span><span class="o">::</span><span class="n">clr_lighten</span><span class="p">(</span><span class="n">colour</span><span class="p">,</span><span class="w"> </span><span class="n">space</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"combined"</span><span class="p">))),</span><span class="w">
</span><span class="n">arrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">arrow</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="m">0.15</span><span class="p">,</span><span class="w"> </span><span class="s2">"inches"</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">percent</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Prevalence of response"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Key differences between LGB+ and heterosexual Australians"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Responses from the General Social Survey 2020"</span><span class="p">)</span></code></pre></figure>
<p>In case anyone’s interested, I did over 100 iterations of this chart, looking for something that really presents some analytical findings in an interpretable and aesthetically pleasing way. It’s much harder to do presentation charts than the straightforward ones that just show all the numbers at once. One interesting point is that the great majority of the work is in reshaping and tidying data - in this case, different (but obviously derivative) data frames for drawing the points and for drawing the arrow segments. A nice example of how data management and plot polishing become an iterative process. You’d never publish the data in the format for a plot like this, it needs to be up to the end analyst to identify this as the plot they want, and reshape the data accordingly.</p>
<p>There was also a lot of effort going into decisions like “Do I need the ‘category’ (e.g. Age group) for each label on the charts, or are some self-explanatory, and how do I code this in without retyping all the data?”.</p>
<p>That code also produced a more straightforward chart which I show here for comparison. It’s the same info (although question categories with long names are excluded for readability), but puts a lot more work on the reader to pick out the key messages. The final plot I used tries to help the reader with this by doing a bunch of analysis ourselves first and presenting in a way that highlights the most important aspects of it.</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0278-many-facets.svg" width="100%"><img src="https://freerangestats.info/img/0278-many-facets.png" width="100%" /></object>
<p>That’s all for today. Do read (and follow, in your next survey questionnaire with questions on sex or gender) that standard on sex, gender, etc. - it’s good stuff.</p>
<h3 id="late-edits--update">Late edits / update</h3>
<ul>
<li>thanks to @baptnz@social.nz who pointed out on Mastodon a better method of making the arrows appear light rather than the transparency hack I originally used. His method, using the ‘prismatic’ package, I have edited my code above and is now in use.</li>
<li>immediately after I wrote all the above I discovered that while I was publishing the original post, the Government has now announced that the questions on sexual orientation, sex at birth, and gender now will all be included in the Census; but not the question on sexual characteristics. So this is a good outcome for those wanting to estimate gender issues and trans/cis status, although those with interest in sexual characteristics are still disappointed.</li>
</ul>
</description>
<pubDate>Sun, 08 Sep 2024 00:00:00 +1100</pubDate>
<link>https://freerangestats.info/blog/2024/09/08/sex-gender</link>
<guid isPermaLink="true">https://freerangestats.info/blog/2024/09/08/sex-gender</guid>
</item>
<item>
<title>Sampling without replacement with unequal probabilities by @ellis2013nz</title>
<description><h2 id="not-proportional-to-w">Not proportional to w</h2>
<p>A week ago I was surprised to read on <a href="https://notstatschat.rbind.io/2024/08/26/another-way-to-not-sample-with-replacement/">Thomas Lumley’s Biased and Inefficient</a> blog that when using R’s <code class="language-plaintext highlighter-rouge">sample()</code> function without replacement and with unequal probabilities of individual units being sampled:</p>
<blockquote>
<p>“What R currently has is sequential sampling: if you give it a set of priorities w it will sample an element with probability proportional to w from the population, remove it from the population, then sample with probability proportional to w from the remaining elements, and so on. This is useful, but a lot of people don’t realise that the probability of element i being sampled is not proportional to w_i”</p>
</blockquote>
<p>This surprised me profoundly - in the way that Bertrand Russell once wrote of surprise in some epistemological context of getting to the bottom of the stairs and finding there was one more stair to go than you unconsciously expected. That is, I hadn’t even been consciously aware of assuming that the probability of an element being sampled is proportional to the weight given to it, but if I <em>had</em> been asked if this was the case I would have given the wrong answer (‘yes’).</p>
<p>So I didn’t know this, but I should have.</p>
<p>Thomas had some examples in his blog relating to streams of data but I wanted to see for myself. So I wrote a function to simulate sampling from a small, finite population, with equal or unequal weights. So mostly, I sampled 10 units at a time from a population of 20; and gave them weights from about 0.005 to 0.095 ( a big range). The weights are forced to add up to 1 so they can be easily compared with that element’s proportion of the eventual sample of samples. And then I take some thousands of these samples of 10, and look at the proportion of the total collection that is each element.</p>
<p>In the charts the follow, the brown diagonal line would be where the eventual proportion of element i being sampled is proportional to its weight. First, here is the main situation of concern - unequal weights, sampling without replacement, sample size is quite a big proportion of the population (10 out of 20):</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-20-10-no-replace.svg" width="100%"><img src="https://freerangestats.info/img/0276-20-10-no-replace.png" width="100%" /></object>
<p>OK! so we see the elements with low weights get sampled a bit more than naively expected, and the elements with high weights a bit less.</p>
<p>Just to check I’m not dreaming, here’s the same simulation but this time we are sampling <em>with</em> replacement. Now, everything works as intuitively might be expected, the eventual proportion in our sample of samples is exactly matched to the original weights:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-20-10-replace.svg" width="100%"><img src="https://freerangestats.info/img/0276-20-10-replace.png" width="100%" /></object>
<p>Here’s the code for that function and those two runs of it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">glue</span><span class="p">)</span><span class="w">
</span><span class="n">compare_ppswor</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">reps</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e5</span><span class="p">,</span><span class="w">
</span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">N</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">N</span><span class="p">),</span><span class="w">
</span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">){</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"unit"</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">N</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">prob</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">prob</span><span class="p">)</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">reps</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">size</span><span class="p">){</span><span class="w">
</span><span class="n">FUN</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">prob</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">replace</span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">replace</span><span class="p">,</span><span class="w"> </span><span class="s1">'with replacement'</span><span class="p">,</span><span class="w"> </span><span class="s1">'without replacement'</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">identical</span><span class="p">(</span><span class="n">FUN</span><span class="p">,</span><span class="w"> </span><span class="n">sample</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"R's native `sample()` function."</span><span class="p">,</span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">FUN</span><span class="p">,</span><span class="w"> </span><span class="n">sample_unequal</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"experimental function based on Brewer (1975)."</span><span class="p">,</span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">FUN</span><span class="p">,</span><span class="w"> </span><span class="n">sample_brewer</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Brewer (1975) as implemented by Tillé/Matei."</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="w">
</span><span class="n">original</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">prob</span><span class="p">,</span><span class="w">
</span><span class="n">selected</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">samples</span><span class="p">)))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">reps</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">original</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">selected</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_abline</span><span class="p">(</span><span class="n">slope</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">intercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"orange"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"steelblue"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Original 'probability' or weight"</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Actual proportion of selections"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"Population of {N}, sample size {n}, sampling {s}.\nUsing {m}"</span><span class="p">),</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Use of `sample()` with unequal probabilities of sampling"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_equal</span><span class="p">()</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">p</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">compare_ppswor</span><span class="p">()</span><span class="w">
</span><span class="n">compare_ppswor</span><span class="p">(</span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>
<p>When the sample is a smaller proportion of a population, the no-replacement discrepancy is less, as seen in these examples with population sizes of 50 and 250.</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-50-10-no-replace.svg" width="100%"><img src="https://freerangestats.info/img/0276-50-10-no-replace.png" width="100%" /></object>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-250-10-no-replace.svg" width="100%"><img src="https://freerangestats.info/img/0276-250-10-no-replace.png" width="100%" /></object>
<p>At the extreme, if the sample size is the same size as the population (20 each in the simulation below), then of course, all samples without replacement are identical to the population, so it is impossible for the end proportion to be anything other than 1/N each:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-20-20-no-replace.svg" width="100%"><img src="https://freerangestats.info/img/0276-20-20-no-replace.png" width="100%" /></object>
<h2 id="brewers-1975-algorithm">Brewer’s 1975 algorithm</h2>
<p>Thomas was writing because there are plans to change this behaviour of <code class="language-plaintext highlighter-rouge">sample()</code>, or at least add an option to change it, so instead the eventual proportion of samples containing an element will be proportional to the weight even when sampling with unequal probabilities and without replacement. The method that is going to be implemented in R (written in C of course before being hooked into R - it’s an inherently iterative thing that will only be efficient in compiled code) is based on a method published way back in 1975 by Ken Brewer.</p>
<p>An aside - I had the privilege and pleasure to be taught sampling by Professor Brewer back in the 1990s at the Australian National Universitiy - at least it was a pleasure at the time (sheer entertainment value as well as the pleasure of witnessing someone who was a virtuoso in the subject matter, master of the theory of sampling), and I now belatedly realise what a privilege it was. I do think he struggled occasionally with some of the students though; one of my vivid memories of my whole education at ANU was of him in confusion when one of the less mathematically inclined asked him to explain what an equation taking up a fair proportion of the blackboard “means” - “Means? means? it means what it says!”</p>
<p>I don’t have easy access to the 1975 paper but I found this very interesting <a href="https://stats.stackexchange.com/questions/110178/brewers-method-for-sampling-with-unequal-probabilities-with-n2">question relating to it on Cross-Validated</a>. From that discussion and reading the part of the SAS manual referred to, it sounds like SAS implements the algorithm only for when the sample in a given strata is 2, but Brewer had generalised his method to larger samples. The answer given by the ever-reliable StasK seems pretty clear on how this is done, so presuming he got it right I had a go at implementing this method into R, with code below in the form of two functions <code class="language-plaintext highlighter-rouge">P()</code> and <code class="language-plaintext highlighter-rouge">sample_unequal()</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' @param p probabilities of remaining units</span><span class="w">
</span><span class="cd">#' @param n total sample size</span><span class="w">
</span><span class="cd">#' @param k which sequence of the sample this is for the sampling of</span><span class="w">
</span><span class="n">P</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">){</span><span class="w">
</span><span class="n">r</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">D</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sum</span><span class="p">((</span><span class="n">p</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">p</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">p</span><span class="p">))</span><span class="w">
</span><span class="n">new_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">p</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">D</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">p</span><span class="p">))</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">new_p</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">sample_unequal</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">keep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">){</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">)){</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"Sample size cannot be larger than the population of units"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="n">replace</span><span class="p">){</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"Only sampling without replacement implemented at this point"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">prob</span><span class="p">))){</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="n">glue</span><span class="p">(</span><span class="s2">"Sample size ({size}) cannot be larger than 1 / max(prob) (which is {1 / max(prob)})"</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="p">)</span><span class="w">
</span><span class="n">the_sample</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="kc">NULL</span><span class="p">]</span><span class="w">
</span><span class="n">remnants</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">x</span><span class="w">
</span><span class="n">remnant_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">prob</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">k</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">size</span><span class="p">){</span><span class="w">
</span><span class="n">new_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">P</span><span class="p">(</span><span class="n">remnant_p</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="n">keep</span><span class="p">){</span><span class="w">
</span><span class="n">d2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">remnants</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_p</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">d2</span><span class="p">)[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"k"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">d2</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"x"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">new_p</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0</span><span class="p">){</span><span class="w">
</span><span class="n">warning</span><span class="p">(</span><span class="s2">"Some negative probabilities returned"</span><span class="p">)</span><span class="w">
</span><span class="n">new_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">pmax</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">new_p</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">latest_sample</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">remnants</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_p</span><span class="p">)</span><span class="w">
</span><span class="n">which_chosen</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">remnants</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">latest_sample</span><span class="p">)</span><span class="w">
</span><span class="n">remnants</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">remnants</span><span class="p">[</span><span class="o">-</span><span class="n">which_chosen</span><span class="p">]</span><span class="w">
</span><span class="n">remnant_p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">remnant_p</span><span class="p">[</span><span class="o">-</span><span class="n">which_chosen</span><span class="p">]</span><span class="w">
</span><span class="n">the_sample</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">the_sample</span><span class="p">,</span><span class="w"> </span><span class="n">latest_sample</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="n">keep</span><span class="p">){</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">the_sample</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">the_sample</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">d</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">the_sample</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w"> </span></code></pre></figure>
<p>I’m not guaranteeing I got it right - feedback is welcome.</p>
<p>Of course, I designed this to be plugged into my function I made earlier for comparing eventual sample proportions to the weights given, so with <code class="language-plaintext highlighter-rouge">compare_ppswor(FUN = sample_unequal, reps = 10000)</code> I can get this comparison:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-20-10-no-replace-brewer.svg" width="100%"><img src="https://freerangestats.info/img/0276-20-10-no-replace-brewer.png" width="100%" /></object>
<p>Interestingly, it’s still not exactly ‘right’ in the sense that the eventual proportions in the sample aren’t exactly those given in the form of weights. But it’s much better at doing this than the current <code class="language-plaintext highlighter-rouge">sample()</code> method is.</p>
<p>And if we have a sample that is say 25% of the population instead of 50% with <code class="language-plaintext highlighter-rouge">compare_ppswor(n = 5, FUN = sample_unequal, reps = 10000)</code>, then we get results where the proportions and weights are pretty indistinguishable:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-20-5-no-replace-brewer.svg" width="100%"><img src="https://freerangestats.info/img/0276-20-5-no-replace-brewer.png" width="100%" /></object>
<p>OK that’s all. I played around with this a bit more, particularly trying to understand the limitations on the algorithm which breaks down as the sample size gets bigger in relation to the population, but none of it was really interesting enough to write up. A full script with a bit more in it is available in the <a href="https://github.com/ellisp/blog-source/blob/master/_working/0276-sampling-unequal-p.R">source code for this blog</a>.</p>
<p>I guess the meta-lesson is to watch out for things you’re assuming without noticing you are.</p>
<h2 id="late-addition">Late addition!</h2>
<p>When I posted about this on Mastodon, Thomas Lumley drew my attention to the <code class="language-plaintext highlighter-rouge">sampling</code> library by Yves Tillé and Alina Matei, which contains multiple algorithms for sampling with unequal probabilities without replacement. I hadn’t even stopped to look because really this was about me learning, but yes, of course there’s an R package for that. So I wrapped their <code class="language-plaintext highlighter-rouge">UPbrewer</code> function in a little function to work with my comparison. Unlike me they had read Brewer’s actual article, and their implementation of course works perfectly:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0276-20-10-brewer-better.svg" width="100%"><img src="https://freerangestats.info/img/0276-20-10-brewer-better.png" width="100%" /></object>
<p>So the version I hacked together based on a Cross-Validated answer clearly doesn’t do justice to the method proposed by Brewer (or more likely was some other method he proposed separately).</p>
<p>For the record here’s the little convenience function to test the Tillé/Matei implementation:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#--------------using sampling library------------------</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">sampling</span><span class="p">)</span><span class="w">
</span><span class="n">sample_brewer</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">keep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">){</span><span class="w">
</span><span class="n">pik</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">prob</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">size</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">UPbrewer</span><span class="p">(</span><span class="n">pik</span><span class="p">)</span><span class="w">
</span><span class="n">the_sample</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">s</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)]</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">the_sample</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">compare_ppswor</span><span class="p">(</span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample_brewer</span><span class="p">,</span><span class="w"> </span><span class="n">reps</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span></code></pre></figure>
<p>I’ts <em>much</em> faster too!</p>
</description>
<pubDate>Sat, 31 Aug 2024 00:00:00 +1100</pubDate>
<link>https://freerangestats.info/blog/2024/08/31/ppswor</link>
<guid isPermaLink="true">https://freerangestats.info/blog/2024/08/31/ppswor</guid>
</item>
<item>
<title>Ratios of indexed line charts by @ellis2013nz</title>
<description><p>A few months Michael Read of the Australian Financial Review published a chart showing real household income in Australia plummetting, relative to the OECD, since late 2021. A version of this <a href="https://mastodon.social/@andyjennings@aus.social/113020422083231612">drifted across my social media feed</a> and interested me.</p>
<p>First I wanted to re-create the chart, which was straightforward enough as it was using data published by the OECD, accessible in SDMX format from their dot Stat tool. After a bit of experimenting I decided to also look at household expenditure and GDP per capita; because I wanted to understand how much this was just an income issue or a broader one translating into expenditure and GDP. That gives me this chart; the middle panel is identical in substance to Read’s original:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0275-inc-exp-gdp-2007.svg" width="100%"><img src="https://freerangestats.info/img/0275-inc-exp-gdp-2007.png" width="100%" /></object>
<p>There’s some interesting things here!</p>
<ul>
<li>Yes, the story is bad for Australia household income since late 2021, as per the story of the original chart.</li>
<li>It’s not just an income story - expenditure and GDP have both declined as well, although not as dramatically (implying that some of the lost income is translating into lower savings / wealth).</li>
<li>The Covid spike upwards in income is matched with a downwards spike in expenditure and GDP.</li>
<li>Putting aside recent declines, all three series (household income, household expenditure, GDP per capita) show for both Australia and the OECD average that people are better off than in 2007, when the index is set to 100.</li>
</ul>
<p>It’s worth noting before we go on that “real gross disposable income” means it has been adjusted for inflation, it is income after subtracting taxes, social security contributions, change in net equity in pension funds, and interest on financial liabilities (including, I believe, mortgages). So an increase in interest rates that hits households with mortgages is one thing that would show up here.</p>
<p>Here’s the R code that produced that first chart:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">rsdmx</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">janitor</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span><span class="w">
</span><span class="n">metadata</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tribble</span><span class="p">(</span><span class="o">~</span><span class="n">measure</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="n">full_measure</span><span class="p">,</span><span class="w">
</span><span class="s2">"B6GS1M_R_POP"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Real gross disposable income per capita of households and NPISH"</span><span class="p">,</span><span class="w">
</span><span class="s2">"P3S1M_R_POP"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Real final consumption expenditure per capita of households and NPISH"</span><span class="p">,</span><span class="w">
</span><span class="s2">"B1GQ_R_POP"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Real gross domestic product per capita"</span><span class="p">)</span><span class="w">
</span><span class="n">the_caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Source: OECD dot Stat P3S1M_R_POP, B6GS1M_R_POP, B1GQ_R_POP. 'NPISH' means 'non-profit institutions serving households'."</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readSDMX</span><span class="p">(</span><span class="s2">"https://sdmx.oecd.org/public/rest/data/OECD.SDD.NAD,DSD_HHDASH@DF_HHDASH_INDIC,1.0/Q.OECD+AUS.B1GQ_R_POP+B6GS1M_R_POP+P3S1M_R_POP.?dimensionAtObservation=AllDimensions"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">clean_names</span><span class="p">()</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">tp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">yq</span><span class="p">(</span><span class="n">time_period</span><span class="p">)</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">left_join</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"measure"</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">ref_area</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
</span><span class="n">ref_area</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"AUS"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Australia"</span><span class="p">,</span><span class="w">
</span><span class="n">ref_area</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"OECD"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"OECD average"</span><span class="w">
</span><span class="p">))</span><span class="w">
</span><span class="n">palette</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Australia"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">`OECD average`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">)</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tp</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">obs_value</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ref_area</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">str_wrap</span><span class="p">(</span><span class="n">full_measure</span><span class="p">,</span><span class="w"> </span><span class="m">40</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">palette</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="m">0.7</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Index (2007 = 100)"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Growth in household expenditure and income, Australia and OECD average"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Indexes set to 100 in 2007. Comparison shows relative growth rates, not absolute difference."</span><span class="p">,</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">the_caption</span><span class="p">)</span></code></pre></figure>
<p>Now, I’ve long been interested in the impact of choice of year for indexing series like this. I have argued before <a href="/blog/2016/08/18/dualaxes">on this blog</a> that many of the criticisisms of dual axes timeseries plots can be mitigated by treating the lines on the page as indexed time series and choosing scales accordingly; but related to this is the risk that choosing a different index year for your series can have significant difference on the visual impression of the chart.</p>
<p>In the case of the income chart, I was struck with how Australia’s real household income grew faster than the OECD average for some years from 2007 and then, from around 2014, the OECD started to catch up. But to see this you have to pay attention to the gap between the lines, whereas the crossing of the lines in 2022 is visually dramatic and appears to really mean something. But it doesn’t! or at least, not as much as the naive viewer might think; what the crossing of the lines means is just that the OECD average growth since 2007 has caught up with the Australian growth since 2007, it doesn’t mean the OECD average has caught up in absolute terms.</p>
<p>If we’re not particularly interested in 2007 as a starting point, we can choose some other period. In the next chart I’ve set it to be 2014 which seems to me to be the time that OECD average growth sped up, and Australian growth slowed down. And I think this chart really helps to show that the poor relative performance of Australia on these indicators isn’t just the dramatic decline in the last couple of years, but a longer lasting phenomenon:</p>
<object type="image/svg+xml" data="https://freerangestats.info/img/0275-inc-exp-gdp-2014.svg" width="100%"><img src="https://freerangestats.info/img/0275-inc-exp-gdp-2014.png" width="100%" /></object>
<p>Interesting, huh? that chart drawn with this:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">d</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">ref_area</span><span class="p">,</span><span class="w"> </span><span class="n">measure</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">obs_value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">obs_value</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">obs_value</span><span class="p">[</span><span class="n">time_period</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"2014-Q1"</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="o">|&gt;</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tp</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">obs_value</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ref_area</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">str_wrap</span><span class="p">(</span><span class="n">full_measure</span><span class="p">,</span><span class="w"> </span><span class="m">40</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">palette</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="m">0.7</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Index (2014 = 100)"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Growth in household expenditure and income, Australia and OECD average"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Indexes set to 100 in first quarter of 2014. Comparison shows relative growth rates, not absolute difference."</span><span class="p">,</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">the_caption</span><span class="p">)</span></code></pre></figure>
<p>Finally, I wanted to help the viewer make the comparison I was doing for myself in my head. I realised that one way to do this is to turn the two lines into a single line that is the ratio between them. That gives me this chart. I like this; I think the analytical step of dividing one of the indexed indicators by the other really is adding value here, if the intent is to compare growth in Australian indicators relative to the OECD’s. Now the turning point in around 2014 is much more obvious, and the shape of the line is robust to choice of index year, which is a very desirable property:</p>