-
Notifications
You must be signed in to change notification settings - Fork 102
/
11_regression.html
2358 lines (2249 loc) · 180 KB
/
11_regression.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Spatial Regression — Geographic Data Science with Python</title>
<link rel="stylesheet" href="../_static/css/index.f658d18f9b420779cfdf24aa0a7e2d77.css">
<link rel="stylesheet"
href="../_static/vendor/fontawesome/5.13.0/css/all.min.css">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">
<link rel="stylesheet"
href="../_static/vendor/open-sans_all/1.44.1/index.css">
<link rel="stylesheet"
href="../_static/vendor/lato_latin-ext/1.44.1/index.css">
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/sphinx-book-theme.e7340bb3dbd8dde6db86f25597f54a1b.css" type="text/css" />
<link rel="stylesheet" type="text/css" href="../_static/togglebutton.css" />
<link rel="stylesheet" type="text/css" href="../_static/copybutton.css" />
<link rel="stylesheet" type="text/css" href="../_static/mystnb.css" />
<link rel="stylesheet" type="text/css" href="../_static/sphinx-thebe.css" />
<link rel="stylesheet" type="text/css" href="../_static/custom.css" />
<link rel="stylesheet" type="text/css" href="../_static/panels-main.c949a650a448cc0ae9fd3441c0e17fb0.css" />
<link rel="stylesheet" type="text/css" href="../_static/panels-variables.06eb56fa6e07937060861dad626602ad.css" />
<link rel="preload" as="script" href="../_static/js/index.d3f166471bb80abb5163.js">
<script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>
<script src="../_static/underscore.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/togglebutton.js"></script>
<script src="../_static/clipboard.min.js"></script>
<script src="../_static/copybutton.js"></script>
<script >var togglebuttonSelector = '.toggle, .admonition.dropdown, .tag_hide_input div.cell_input, .tag_hide-input div.cell_input, .tag_hide_output div.cell_output, .tag_hide-output div.cell_output, .tag_hide_cell.cell, .tag_hide-cell.cell';</script>
<script src="../_static/sphinx-book-theme.7d483ff0a819d6edff12ce0b1ead3928.js"></script>
<script async="async" src="https://unpkg.com/thebelab@latest/lib/index.js"></script>
<script >
const thebe_selector = ".thebe"
const thebe_selector_input = "pre"
const thebe_selector_output = ".output"
</script>
<script async="async" src="../_static/sphinx-thebe.js"></script>
<script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({"tex2jax": {"inlineMath": [["\\(", "\\)"]], "displayMath": [["\\[", "\\]"]], "processRefs": false, "processEnvironments": false}})</script>
<link rel="canonical" href="https://geographicdata.science/book/notebooks/11_regression.html" />
<link rel="shortcut icon" href="../_static/favicon.ico"/>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Spatial Feature Engineering" href="12_feature_engineering.html" />
<link rel="prev" title="Clustering & Regionalization" href="10_clustering_and_regionalization.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="en" />
<!-- Opengraph tags -->
<meta property="og:url" content="https://geographicdata.science/book/notebooks/11_regression.html" />
<meta property="og:type" content="article" />
<meta property="og:title" content="Spatial Regression" />
<meta property="og:description" content="Spatial Regression Introduction What is spatial regression and why should I care? Regression (and prediction more generally) provides us a perfect case to ex" />
<meta property="og:image" content="https://geographicdata.science/book/_static/logo.png" />
<meta name="twitter:card" content="summary" />
</head>
<body data-spy="scroll" data-target="#bd-toc-nav" data-offset="80">
<div class="container-xl">
<div class="row">
<div class="col-12 col-md-3 bd-sidebar site-navigation show" id="site-navigation">
<div class="navbar-brand-box">
<a class="navbar-brand text-wrap" href="../index.html">
<img src="../_static/logo.png" class="logo" alt="logo">
<h1 class="site-logo" id="site-title">Geographic Data Science with Python</h1>
</a>
</div><form class="bd-search d-flex align-items-center" action="../search.html" method="get">
<i class="icon fas fa-search"></i>
<input type="search" class="form-control" name="q" id="search-input" placeholder="Search this book..." aria-label="Search this book..." autocomplete="off" >
</form>
<nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation">
<ul class="nav sidenav_l1">
<li class="toctree-l1">
<a class="reference internal" href="../intro.html">
Home
</a>
</li>
</ul>
<p class="caption collapsible-parent">
<span class="caption-text">
Preface
</span>
</p>
<ul class="nav sidenav_l1">
<li class="toctree-l1">
<a class="reference internal" href="00_toc.html">
Table of Contents
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="references.html">
References
</a>
</li>
</ul>
<p class="caption collapsible-parent">
<span class="caption-text">
Part I - Building Blocks
</span>
</p>
<ul class="nav sidenav_l1">
<li class="toctree-l1">
<a class="reference internal" href="../intro_part_i.html">
Overview
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="01_geospatial_computational_environment.html">
Geospatial Computational Environment
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="02_geo_thinking.html">
Geographic thinking for data scientists
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="03_spatial_data.html">
Spatial Data
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="04_spatial_weights.html">
Spatial Weights
</a>
</li>
</ul>
<p class="caption collapsible-parent">
<span class="caption-text">
Part II - Spatial Data Analysis
</span>
</p>
<ul class="nav sidenav_l1">
<li class="toctree-l1">
<a class="reference internal" href="../intro_part_ii.html">
Overview
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="05_choropleth.html">
Choropleth Mapping
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="06_spatial_autocorrelation.html">
Global Spatial Autocorrelation
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="07_local_autocorrelation.html">
Local Spatial Autocorrelation
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="08_point_pattern_analysis.html">
Point Pattern Analysis
</a>
</li>
</ul>
<p class="caption collapsible-parent">
<span class="caption-text">
Part III - Advanced Topics
</span>
</p>
<ul class="current nav sidenav_l1">
<li class="toctree-l1">
<a class="reference internal" href="../intro_part_ii.html">
Overview
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="09_spatial_inequality.html">
Spatial Inequality
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="10_clustering_and_regionalization.html">
Clustering & Regionalization
</a>
</li>
<li class="toctree-l1 current active">
<a class="current reference internal" href="#">
Spatial Regression
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="12_feature_engineering.html">
Spatial Feature Engineering
</a>
</li>
</ul>
<p class="caption collapsible-parent">
<span class="caption-text">
Datasets
</span>
</p>
<ul class="nav sidenav_l1">
<li class="toctree-l1">
<a class="reference internal" href="../data/README.html">
Overview
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/airbnb/regression_cleaning.html">
AirBnb
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/airports/airports_cleaning.html">
Airports
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/brexit/brexit_cleaning.html">
Brexit
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/countries/countries_cleaning.html">
Countries
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/h3_grid/build_sd_h3_grid.html">
H3 Grid
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/mexico/README.html">
Mexico
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/nasadem/build_nasadem_sd.html">
NASA DEM
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/sandiego/sandiego_tracts_cleaning.html">
San Diego Tracts
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/texas/README.html">
Texas
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/tokyo/tokyo_cleaning.html">
Tokyo Photographs
</a>
</li>
<li class="toctree-l1">
<a class="reference internal" href="../data/us_county_income/README.html">
US County Income 1969-2017
</a>
</li>
</ul>
</nav> <!-- To handle the deprecated key -->
<div class="navbar_extra_footer">
Powered by <a href="https://jupyterbook.org">Jupyter Book</a>
</div>
</div>
<main class="col py-md-3 pl-md-4 bd-content overflow-auto" role="main">
<div class="topbar container-xl fixed-top">
<div class="topbar-contents row">
<div class="col-12 col-md-3 bd-topbar-whitespace site-navigation show"></div>
<div class="col pl-md-4 topbar-main">
<button id="navbar-toggler" class="navbar-toggler ml-0" type="button" data-toggle="collapse"
data-toggle="tooltip" data-placement="bottom" data-target=".site-navigation" aria-controls="navbar-menu"
aria-expanded="true" aria-label="Toggle navigation" aria-controls="site-navigation"
title="Toggle navigation" data-toggle="tooltip" data-placement="left">
<i class="fas fa-bars"></i>
<i class="fas fa-arrow-left"></i>
<i class="fas fa-arrow-up"></i>
</button>
<div class="dropdown-buttons-trigger">
<button id="dropdown-buttons-trigger" class="btn btn-secondary topbarbtn" aria-label="Download this page"><i
class="fas fa-download"></i></button>
<div class="dropdown-buttons">
<!-- ipynb file if we had a myst markdown file -->
<!-- Download raw file -->
<a class="dropdown-buttons" href="../_sources/notebooks/11_regression.ipynb"><button type="button"
class="btn btn-secondary topbarbtn" title="Download source file" data-toggle="tooltip"
data-placement="left">.ipynb</button></a>
<!-- Download PDF via print -->
<button type="button" id="download-print" class="btn btn-secondary topbarbtn" title="Print to PDF"
onClick="window.print()" data-toggle="tooltip" data-placement="left">.pdf</button>
</div>
</div>
<!-- Source interaction buttons -->
<!-- Full screen (wrap in <a> to have style consistency -->
<a class="full-screen-button"><button type="button" class="btn btn-secondary topbarbtn" data-toggle="tooltip"
data-placement="bottom" onclick="toggleFullScreen()" aria-label="Fullscreen mode"
title="Fullscreen mode"><i
class="fas fa-expand"></i></button></a>
<!-- Launch buttons -->
<div class="dropdown-buttons-trigger">
<button id="dropdown-buttons-trigger" class="btn btn-secondary topbarbtn"
aria-label="Launch interactive content"><i class="fas fa-rocket"></i></button>
<div class="dropdown-buttons">
<a class="binder-button" href="https://mybinder.org/v2/gh/gdsbook/book/master?urlpath=lab/tree/notebooks/11_regression.ipynb"><button type="button"
class="btn btn-secondary topbarbtn" title="Launch Binder" data-toggle="tooltip"
data-placement="left"><img class="binder-button-logo"
src="../_static/images/logo_binder.svg"
alt="Interact on binder">Binder</button></a>
<a class="colab-button" href="https://colab.research.google.com/github/gdsbook/book/blob/master/notebooks/11_regression.ipynb"><button type="button" class="btn btn-secondary topbarbtn"
title="Launch Colab" data-toggle="tooltip" data-placement="left"><img class="colab-button-logo"
src="../_static/images/logo_colab.png"
alt="Interact on Colab">Colab</button></a>
</div>
</div>
</div>
<!-- Table of contents -->
<div class="d-none d-md-block col-md-2 bd-toc show">
<div class="tocsection onthispage pt-5 pb-3">
<i class="fas fa-list"></i>
Contents
</div>
<nav id="bd-toc-nav">
<ul class="nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#introduction">
Introduction
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry">
<a class="reference internal nav-link" href="#what-is-spatial-regression-and-why-should-i-care">
<em>
What
</em>
is spatial regression and
<em>
why
</em>
should I care?
</a>
</li>
<li class="toc-h3 nav-item toc-entry">
<a class="reference internal nav-link" href="#the-data-san-diego-airbnb">
The Data: San Diego AirBnB
</a>
</li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#non-spatial-regression-a-very-quick-refresh">
Non-spatial regression, a (very) quick refresh
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry">
<a class="reference internal nav-link" href="#hidden-structures">
Hidden Structures
</a>
</li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#bringing-space-into-the-regression-framework">
Bringing space into the regression framework
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry">
<a class="reference internal nav-link" href="#spatial-feature-engineering">
Spatial Feature Engineering
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#proximity-variables">
Proximity variables
</a>
</li>
</ul>
</li>
<li class="toc-h3 nav-item toc-entry">
<a class="reference internal nav-link" href="#spatial-heterogeneity">
Spatial Heterogeneity
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#spatial-regimes">
Spatial Regimes
</a>
</li>
</ul>
</li>
<li class="toc-h3 nav-item toc-entry">
<a class="reference internal nav-link" href="#spatial-dependence">
Spatial Dependence
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#exogenous-effects-the-slx-model">
Exogenous effects: The SLX Model
</a>
</li>
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#spatial-error">
Spatial Error
</a>
</li>
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#spatial-lag">
Spatial Lag
</a>
</li>
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#other-ways-of-bringing-space-into-regression">
Other ways of bringing space into regression
</a>
</li>
</ul>
</li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#questions">
Questions
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry">
<a class="reference internal nav-link" href="#challenge-questions">
Challenge Questions
</a>
<ul class="nav section-nav flex-column">
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#the-random-coast">
The random coast
</a>
</li>
<li class="toc-h4 nav-item toc-entry">
<a class="reference internal nav-link" href="#the-k-neighbor-correlogram">
The K-neighbor correlogram
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div id="main-content" class="row">
<div class="col-12 col-md-9 pl-md-3 pr-md-0">
<div>
<div class="section" id="spatial-regression">
<h1>Spatial Regression<a class="headerlink" href="#spatial-regression" title="Permalink to this headline">¶</a></h1>
<div class="section" id="introduction">
<h2>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline">¶</a></h2>
<div class="section" id="what-is-spatial-regression-and-why-should-i-care">
<h3><em>What</em> is spatial regression and <em>why</em> should I care?<a class="headerlink" href="#what-is-spatial-regression-and-why-should-i-care" title="Permalink to this headline">¶</a></h3>
<p>Regression (and prediction more generally) provides us a perfect case to examine how spatial structure can help us understand and analyze our data.
Usually, spatial structure helps models in one of two ways.
The first (and most clear) way space can have an impact on our data is when the process <em>generating</em> the data is itself explicitly spatial.
Here, think of something like the prices for single family homes.
It’s often the case that individuals pay a premium on their house price in order to live in a better school district for the same quality house.
Alternatively, homes closer to noise or chemical polluters like waste water treatment plants, recycling facilities, or wide highways, may actually be cheaper than we would otherwise anticipate.
Finally, in cases like asthma incidence, the locations individuals tend to travel to throughout the day, such as their places of work or recreation, may have more impact on their health than their residential addresses.
In this case, it may be necessary to use data <em>from other sites</em> to predict the asthma incidence at a given site.
Regardless of the specific case at play, here, <em>geography is a feature</em>: it directly helps us make predictions about outcomes <em>because those outcomes obtain from geographical processes</em>.</p>
<p>An alternative (and more skeptical understanding) reluctantly acknowledges geography’s instrumental value.
Often, in the analysis of predictive methods and classifiers, we are interested in analyzing what we get wrong.
This is common in econometrics; an analyst may be concerned that the model <em>systematically</em> mis-predicts some types of observations.
If we know our model routinely performs poorly on a known set of observations or type of input, we might make a better model if we can account for this.
Among other kinds of error diagnostics, geography provides us with an exceptionally-useful embedding to assess structure in our errors.
Mapping classification/prediction error can help show whether or not there are <em>clusters of error</em> in our data.
If we <em>know</em> that errors tend to be larger in some areas than in other areas (or if error is “contagious” between observations), then we might be able to exploit this structure to make better predictions.</p>
<p>Spatial structure in our errors might arise from when geography <em>should be</em> an attribute somehow, but we are not sure exactly how to include it in our model.
They might also arise because there is some <em>other</em> feature whose omission causes the spatial patterns in the error we see; if this additional feature were included, the structure would disappear.
Or, it might arise from the complex interactions and interdependences between the features that we have chosen to use as predictors, resulting in intrinsic structure in mis-prediction.
Most of the predictors we use in models of social processes contain <em>embodied</em> spatial information: patterning intrinsic to the feature that we get for free in the model.
If we intend to or not, using a spatially-patterned predictor in a model can result in spatially-patterned errors; using more than one can amplify this effect.
Thus, <em>regardless of whether or not the true process is explicitly geographic</em>, additional information about the spatial relationships between our observations or more information about nearby sites can make our predictions better.</p>
</div>
<div class="section" id="the-data-san-diego-airbnb">
<h3>The Data: San Diego AirBnB<a class="headerlink" href="#the-data-san-diego-airbnb" title="Permalink to this headline">¶</a></h3>
<p>To learn a little more about how regression works, we’ll examine some information about AirBnB in San Diego, CA.
This dataset contains house intrinsic characteristics, both continuous (number of beds as in <code class="docutils literal notranslate"><span class="pre">beds</span></code>) and categorical (type of renting or, in AirBnb jargon, property group as in the series of <code class="docutils literal notranslate"><span class="pre">pg_X</span></code> binary variables), but also variables that explicitly refer to the location and spatial configuration of the dataset (e.g. distance to Balboa Park, <code class="docutils literal notranslate"><span class="pre">d2balboa</span></code> or neigbourhood id, <code class="docutils literal notranslate"><span class="pre">neighbourhood_cleansed</span></code>).</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">%</span><span class="k">matplotlib</span> inline
<span class="kn">from</span> <span class="nn">pysal.model</span> <span class="kn">import</span> <span class="n">spreg</span>
<span class="kn">from</span> <span class="nn">pysal.lib</span> <span class="kn">import</span> <span class="n">weights</span>
<span class="kn">from</span> <span class="nn">pysal.explore</span> <span class="kn">import</span> <span class="n">esda</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="k">as</span> <span class="nn">sm</span>
<span class="kn">import</span> <span class="nn">numpy</span>
<span class="kn">import</span> <span class="nn">pandas</span>
<span class="kn">import</span> <span class="nn">geopandas</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/opt/conda/lib/python3.8/site-packages/spglm/utils.py:367: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if resetlist is not ():
/opt/conda/lib/python3.8/site-packages/spvcm/utils.py:149: SyntaxWarning: "is" with a literal. Did you mean "=="?
if np.isinf(ldet) or sgn is 0:
/opt/conda/lib/python3.8/site-packages/spvcm/abstracts.py:268: SyntaxWarning: "is" with a literal. Did you mean "=="?
if chains is () and kwargs != dict():
/opt/conda/lib/python3.8/site-packages/spvcm/abstracts.py:270: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if chains is not ():
/opt/conda/lib/python3.8/site-packages/spvcm/plotting.py:37: SyntaxWarning: "is" with a literal. Did you mean "=="?
if thin is None or thin is 0:
</pre></div>
</div>
</div>
</div>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">db</span> <span class="o">=</span> <span class="n">geopandas</span><span class="o">.</span><span class="n">read_file</span><span class="p">(</span><span class="s1">'../data/airbnb/regression_db.geojson'</span><span class="p">)</span>
<span class="n">db</span><span class="o">.</span><span class="n">info</span><span class="p">()</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span><class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 6110 entries, 0 to 6109
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 accommodates 6110 non-null int64
1 bathrooms 6110 non-null float64
2 bedrooms 6110 non-null float64
3 beds 6110 non-null float64
4 neighborhood 6110 non-null object
5 pool 6110 non-null int64
6 d2balboa 6110 non-null float64
7 coastal 6110 non-null int64
8 price 6110 non-null float64
9 log_price 6110 non-null float64
10 id 6110 non-null int64
11 pg_Apartment 6110 non-null int64
12 pg_Condominium 6110 non-null int64
13 pg_House 6110 non-null int64
14 pg_Other 6110 non-null int64
15 pg_Townhouse 6110 non-null int64
16 rt_Entire_home/apt 6110 non-null int64
17 rt_Private_room 6110 non-null int64
18 rt_Shared_room 6110 non-null int64
19 geometry 6110 non-null geometry
dtypes: float64(6), geometry(1), int64(12), object(1)
memory usage: 954.8+ KB
</pre></div>
</div>
</div>
</div>
<p>These are the explanatory variables we will use throughout the chapter.</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">variable_names</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'accommodates'</span><span class="p">,</span> <span class="s1">'bathrooms'</span><span class="p">,</span> <span class="s1">'bedrooms'</span><span class="p">,</span>
<span class="s1">'beds'</span><span class="p">,</span> <span class="s1">'rt_Private_room'</span><span class="p">,</span> <span class="s1">'rt_Shared_room'</span><span class="p">,</span>
<span class="s1">'pg_Condominium'</span><span class="p">,</span> <span class="s1">'pg_House'</span><span class="p">,</span>
<span class="s1">'pg_Other'</span><span class="p">,</span> <span class="s1">'pg_Townhouse'</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="section" id="non-spatial-regression-a-very-quick-refresh">
<h2>Non-spatial regression, a (very) quick refresh<a class="headerlink" href="#non-spatial-regression-a-very-quick-refresh" title="Permalink to this headline">¶</a></h2>
<p>Before we discuss how to explicitly include space into the linear regression framework, let us show how basic regression can be carried out in Python, and how one can begin to interpret the results. By no means is this a formal and complete introduction to regression so, if that is what you are looking for, we recommend <a href="#id1"><span class="problematic" id="id2">:cite:`Gelman_2006`</span></a>, in particular chapters 3 and 4, which provide a fantastic, non-spatial introduction.</p>
<p>The core idea of linear regression is to explain the variation in a given (<em>dependent</em>) variable as a linear function of a collection of other (<em>explanatory</em>) variables. For example, in our case, we may want to express/explain the price of a house as a function of whether it is new and the degree of deprivation of the area where it is located. At the individual level, we can express this as:</p>
<div class="math notranslate nohighlight">
\[
P_i = \alpha + \sum_k \mathbf{X}_{ik}\beta_k + \epsilon_i
\]</div>
<p>where <span class="math notranslate nohighlight">\(P_i\)</span> is the AirBnb price of house <span class="math notranslate nohighlight">\(i\)</span>, and <span class="math notranslate nohighlight">\(X\)</span> is a set of covariates that we use to explain such price. <span class="math notranslate nohighlight">\(\beta\)</span> is a vector of parameters that give us information about in which way and to what extent each variable is related to the price, and <span class="math notranslate nohighlight">\(\alpha\)</span>, the constant term, is the average house price when all the other variables are zero. The term <span class="math notranslate nohighlight">\(\epsilon_i\)</span> is usually referred to as “error” and captures elements that influence the price of a house but are not included in <span class="math notranslate nohighlight">\(X\)</span>. We can also express this relation in matrix form, excluding subindices for <span class="math notranslate nohighlight">\(i\)</span>, which yields:</p>
<div class="math notranslate nohighlight">
\[
P = \alpha + \mathbf{X}\beta + \epsilon
\]</div>
<p>A regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the <span class="math notranslate nohighlight">\(\beta_k\)</span> coefficients in the equation above is as the degree of correlation between the explanatory variable <span class="math notranslate nohighlight">\(k\)</span> and the dependent variable, <em>keeping all the other explanatory variables constant</em>. When one calculates bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables – also called confounding factors. Regression allows us to isolate the distinct effect that a single variable has on the dependent one, once we <em>control</em> for those other variables.</p>
<p>Practically speaking, linear regressions in Python are rather streamlined and easy to work with. There are also several packages which will run them (e.g. <code class="docutils literal notranslate"><span class="pre">statsmodels</span></code>, <code class="docutils literal notranslate"><span class="pre">scikit-learn</span></code>, <code class="docutils literal notranslate"><span class="pre">PySAL</span></code>). In the context of this chapter, it makes sense to start with <code class="docutils literal notranslate"><span class="pre">PySAL</span></code> as that is the only library that will allow us to move into explicitly spatial econometric models. To fit the model specified in the equation above with <span class="math notranslate nohighlight">\(X\)</span> as the list defined, we only need the following line of code:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">m1</span> <span class="o">=</span> <span class="n">spreg</span><span class="o">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">db</span><span class="p">[[</span><span class="s1">'log_price'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">db</span><span class="p">[</span><span class="n">variable_names</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
<span class="n">name_y</span><span class="o">=</span><span class="s1">'log_price'</span><span class="p">,</span> <span class="n">name_x</span><span class="o">=</span><span class="n">variable_names</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<p>We use the command <code class="docutils literal notranslate"><span class="pre">OLS</span></code>, part of the <code class="docutils literal notranslate"><span class="pre">spreg</span></code> sub-package, and specify the dependent variable (the log of the price, so we can interpret results in terms of percentage change) and the explanatory ones. Note that both objects need to be arrays, so we extract them from the <code class="docutils literal notranslate"><span class="pre">pandas.DataFrame</span></code> object using <code class="docutils literal notranslate"><span class="pre">.values</span></code>.</p>
<p>In order to inspect the results of the model, we can call <code class="docutils literal notranslate"><span class="pre">summary</span></code>:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">m1</span><span class="o">.</span><span class="n">summary</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : None
Dependent Variable : log_price Number of Observations: 6110
Mean dependent var : 4.9958 Number of Variables : 11
S.D. dependent var : 0.8072 Degrees of Freedom : 6099
R-squared : 0.6683
Adjusted R-squared : 0.6678
Sum squared residual: 1320.148 F-statistic : 1229.0564
Sigma-square : 0.216 Prob(F-statistic) : 0
S.E. of regression : 0.465 Log likelihood : -3988.895
Sigma-square ML : 0.216 Akaike info criterion : 7999.790
S.E of regression ML: 0.4648 Schwarz criterion : 8073.685
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 4.3883830 0.0161147 272.3217773 0.0000000
accommodates 0.0834523 0.0050781 16.4336318 0.0000000
bathrooms 0.1923790 0.0109668 17.5419773 0.0000000
bedrooms 0.1525221 0.0111323 13.7009195 0.0000000
beds -0.0417231 0.0069383 -6.0134430 0.0000000
rt_Private_room -0.5506868 0.0159046 -34.6244758 0.0000000
rt_Shared_room -1.2383055 0.0384329 -32.2198992 0.0000000
pg_Condominium 0.1436347 0.0221499 6.4846529 0.0000000
pg_House -0.0104894 0.0145315 -0.7218393 0.4704209
pg_Other 0.1411546 0.0228016 6.1905633 0.0000000
pg_Townhouse -0.0416702 0.0342758 -1.2157316 0.2241342
------------------------------------------------------------------------------------
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 11.964
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 2671.611 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 10 322.532 0.0000
Koenker-Bassett test 10 135.581 0.0000
================================ END OF REPORT =====================================
</pre></div>
</div>
</div>
</div>
<p>A full detailed explanation of the output is beyond the scope of this chapter, so we will focus on the relevant bits for our main purpose. This is concentrated on the <code class="docutils literal notranslate"><span class="pre">Coefficients</span></code> section, which gives us the estimates for <span class="math notranslate nohighlight">\(\beta_k\)</span> in our model. In other words, these numbers express the relationship between each explanatory variable and the dependent one, once the effect of confounding factors has been accounted for. Keep in mind however that regression is no magic; we are only discounting the effect of confounding factors that we include in the model, not of <em>all</em> potentially confounding factors.</p>
<p>Results are largely as expected: houses tend to be significantly more expensive if they accommodate more people (<code class="docutils literal notranslate"><span class="pre">accommodates</span></code>), if they have more bathrooms and bedrooms and if they are a condominium or part of the “other” category of house type. Conversely, given a number of rooms, houses with more beds (ie. listings that are more “crowded”) tend to go for cheaper, as it is the case for properties where one does not rent the entire house but only a room (<code class="docutils literal notranslate"><span class="pre">rt_Private_room</span></code>) or even shares it (<code class="docutils literal notranslate"><span class="pre">rt_Shared_room</span></code>). Of course, you might conceptually doubt the assumption that it is possible to <em>arbitrarily</em> change the number of beds within an Airbnb without eventually changing the number of people it accommodates, but methods to address these concerns using <em>interaction effects</em> won’t be discussed here.</p>
<div class="section" id="hidden-structures">
<h3>Hidden Structures<a class="headerlink" href="#hidden-structures" title="Permalink to this headline">¶</a></h3>
<p>In general, our model performs well, being able to predict slightly more than 65% (<span class="math notranslate nohighlight">\(R^2=0.67\)</span>) of the variation in the mean nightly price using the covariates we’ve discussed above.
But, our model might display some clustering in errors.
To interrogate this, we can do a few things.
One simple concept might be to look at the correlation between the error in predicting an airbnb and the error in predicting its nearest neighbor.
To examine this, we first might want to split our data up by regions and see if we’ve got some spatial structure in our residuals.
One reasonable theory might be that our model does not include any information about <em>beaches</em>, a critical aspect of why people live and vacation in San Diego.
Therefore, we might want to see whether or not our errors are higher or lower depending on whether or not an airbnb is in a “beach” neighborhood, a neighborhood near the ocean:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">is_coastal</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">coastal</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">bool</span><span class="p">)</span>
<span class="n">coastal</span> <span class="o">=</span> <span class="n">m1</span><span class="o">.</span><span class="n">u</span><span class="p">[</span><span class="n">is_coastal</span><span class="p">]</span>
<span class="n">not_coastal</span> <span class="o">=</span> <span class="n">m1</span><span class="o">.</span><span class="n">u</span><span class="p">[</span><span class="o">~</span><span class="n">is_coastal</span><span class="p">]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">coastal</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Coastal'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">not_coastal</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'step'</span><span class="p">,</span>
<span class="n">density</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Not Coastal'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s2">":"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">'k'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<img alt="../_images/11_regression_11_0.png" src="../_images/11_regression_11_0.png" />
</div>
</div>
<p>While it appears that the neighborhoods on the coast have only slightly higher average errors (and have lower variance in their prediction errors), the two distributions are significantly distinct from one another when compared using a classic <span class="math notranslate nohighlight">\(t\)</span> test:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">stats</span><span class="o">.</span><span class="n">ttest_ind</span><span class="p">(</span><span class="n">coastal</span><span class="p">,</span>
<span class="n">not_coastal</span><span class="p">,</span>
<span class="c1"># permutations=9999 not yet available in scipy</span>
<span class="p">)</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>Ttest_indResult(statistic=array([13.98193858]), pvalue=array([9.442438e-44]))
</pre></div>
</div>
</div>
</div>
<p>There are more sophisticated (and harder to fool) tests that may be applicable for this data, however. We cover them in the <a class="reference external" href="#Challenge">Challenge</a> section.</p>
<p>Additionally, it might be the case that some neighborhoods are more desirable than other neighborhoods due to unmodeled latent preferences or marketing.
For instance, despite its presence close to the sea, living near Camp Pendleton -a Marine base in the North of the city- may incur some significant penalties on area desirability due to noise and pollution.
For us to determine whether this is the case, we might be interested in the full distribution of model residuals within each neighborhood.</p>
<p>To make this more clear, we’ll first sort the data by the median residual in that neighborhood, and then make a box plot, which shows the distribution of residuals in each neighborhood:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">db</span><span class="p">[</span><span class="s1">'residual'</span><span class="p">]</span> <span class="o">=</span> <span class="n">m1</span><span class="o">.</span><span class="n">u</span>
<span class="n">medians</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">"neighborhood"</span><span class="p">)</span><span class="o">.</span><span class="n">residual</span><span class="o">.</span><span class="n">median</span><span class="p">()</span><span class="o">.</span><span class="n">to_frame</span><span class="p">(</span><span class="s1">'hood_residual'</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span>
<span class="n">seaborn</span><span class="o">.</span><span class="n">boxplot</span><span class="p">(</span><span class="s1">'neighborhood'</span><span class="p">,</span> <span class="s1">'residual'</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">ax</span><span class="p">,</span>
<span class="n">data</span><span class="o">=</span><span class="n">db</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">medians</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'left'</span><span class="p">,</span>
<span class="n">left_on</span><span class="o">=</span><span class="s1">'neighborhood'</span><span class="p">,</span>
<span class="n">right_index</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">'hood_residual'</span><span class="p">),</span> <span class="n">palette</span><span class="o">=</span><span class="s1">'bwr'</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">autofmt_xdate</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
</pre></div>
</div>
<img alt="../_images/11_regression_16_1.png" src="../_images/11_regression_16_1.png" />
</div>
</div>
<p>No neighborhood is disjoint from one another, but some do appear to be higher than others, such as the well-known downtown tourist neighborhoods areas of the Gaslamp Quarter, Little Italy, or The Core.
Thus, there may be a distinctive effect of intangible neighborhood fashionableness that matters in this model.</p>
<p>Noting that many of the most over- and under-predicted neighborhoods are near one another in the city, it may also be the case that there is some sort of <em>contagion</em> or spatial spillovers in the nightly rent price.
This often is apparent when individuals seek to price their airbnb listings to compete with similar nearby listings.
Since our model is not aware of this behavior, its errors may tend to cluster.
One exceptionally simple way we can look into this structure is by examining the relationship between an observation’s residuals and its surrounding residuals.</p>
<p>To do this, we will use <em>spatial weights</em> to represent the geographic relationships between observations.
We cover spatial weights in detail in another chapter, so we will not repeat ourselves here.
For this example, we’ll start off with a <span class="math notranslate nohighlight">\(KNN\)</span> matrix where <span class="math notranslate nohighlight">\(k=1\)</span>, meaning we’re focusing only on the linkages of each airbnb to their closest other listing.</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">knn</span> <span class="o">=</span> <span class="n">weights</span><span class="o">.</span><span class="n">KNN</span><span class="o">.</span><span class="n">from_dataframe</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/opt/conda/lib/python3.8/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected:
There are 1849 disconnected components.
warnings.warn(message)
</pre></div>
</div>
</div>
</div>
<p>This means that, when we compute the <em>spatial lag</em> of that knn weight and the residual, we get the residual of the airbnb listing closest to each observation.</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">lag_residual</span> <span class="o">=</span> <span class="n">weights</span><span class="o">.</span><span class="n">spatial_lag</span><span class="o">.</span><span class="n">lag_spatial</span><span class="p">(</span><span class="n">knn</span><span class="p">,</span> <span class="n">m1</span><span class="o">.</span><span class="n">u</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">seaborn</span><span class="o">.</span><span class="n">regplot</span><span class="p">(</span><span class="n">m1</span><span class="o">.</span><span class="n">u</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">lag_residual</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span>
<span class="n">line_kws</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s1">'orangered'</span><span class="p">),</span>
<span class="n">ci</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s1">'Model Residuals - $u$'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s1">'Spatial Lag of Model Residuals - $W u$'</span><span class="p">);</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
</pre></div>
</div>
<img alt="../_images/11_regression_20_1.png" src="../_images/11_regression_20_1.png" />
</div>
</div>
<p>In this plot, we see that our prediction errors tend to cluster!
Above, we show the relationship between our prediction error at each site and the prediction error at the site nearest to it.
Here, we’re using this nearest site to stand in for the <em>surroundings</em> of that Airbnb.
This means that, when the model tends to overpredict a given Airbnb’s nightly log price, sites around that Airbnb are more likely to <em>also be overpredicted</em>.</p>
<p>An interesting property of this relationship is that it tends to stabilize as the number of nearest neighbors used to construct each Airbnb’s surroundings increases.
Consult the <a class="reference external" href="#Challenge">Challenge</a> section for more on this property.</p>
<p>Given this behavior, let’s look at the stable <span class="math notranslate nohighlight">\(k=20\)</span> number of neighbors.
Examining the relationship between this stable <em>surrounding</em> average and the focal Airbnb, we can even find clusters in our model error.
Recalling the <em>local Moran</em> statistics, we can identify certain areas where our predictions of the nightly (log) Airbnb price tend to be significantly off:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">knn</span><span class="o">.</span><span class="n">reweight</span><span class="p">(</span><span class="n">k</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">outliers</span> <span class="o">=</span> <span class="n">esda</span><span class="o">.</span><span class="n">moran</span><span class="o">.</span><span class="n">Moran_Local</span><span class="p">(</span><span class="n">m1</span><span class="o">.</span><span class="n">u</span><span class="p">,</span> <span class="n">knn</span><span class="p">,</span> <span class="n">permutations</span><span class="o">=</span><span class="mi">9999</span><span class="p">)</span>
<span class="n">error_clusters</span> <span class="o">=</span> <span class="p">(</span><span class="n">outliers</span><span class="o">.</span><span class="n">q</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># only the cluster cores</span>
<span class="n">error_clusters</span> <span class="o">&=</span> <span class="p">(</span><span class="n">outliers</span><span class="o">.</span><span class="n">p_sim</span> <span class="o"><=</span> <span class="o">.</span><span class="mi">001</span><span class="p">)</span> <span class="c1"># filtering out non-significant clusters</span>
<span class="n">db</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">error_clusters</span> <span class="o">=</span> <span class="n">error_clusters</span><span class="p">,</span>
<span class="n">local_I</span> <span class="o">=</span> <span class="n">outliers</span><span class="o">.</span><span class="n">Is</span><span class="p">)</span>\
<span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="s2">"error_clusters"</span><span class="p">)</span>\
<span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">'local_I'</span><span class="p">)</span>\
<span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="s1">'local_I'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s1">'bwr'</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s1">'.'</span><span class="p">);</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/opt/conda/lib/python3.8/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected:
There are 3 disconnected components.
warnings.warn(message)
</pre></div>
</div>
<img alt="../_images/11_regression_23_1.png" src="../_images/11_regression_23_1.png" />
</div>
</div>
<p>Thus, these areas tend to be locations where our model significantly underpredicts the nightly airbnb price both for that specific observation and observations in its immediate surroundings.
This is critical since, if we can identify how these areas are structured — if they have a <em>consistent geography</em> that we can model — then we might make our predictions even better, or at least not systematically mis-predict prices in some areas while correctly predicting prices in other areas.</p>
<p>Since significant under- and over-predictions do appear to cluster in a highly structured way, we might be able to use a better model to fix the geography of our model errors.</p>
</div>
</div>
<div class="section" id="bringing-space-into-the-regression-framework">
<h2>Bringing space into the regression framework<a class="headerlink" href="#bringing-space-into-the-regression-framework" title="Permalink to this headline">¶</a></h2>
<p>There are many different ways that spatial structure shows up in our models, predictions, and our data, even if we do not explicitly intend to study it.
Fortunately, there are nearly as many techniques, called <em>spatial regression</em> methods, that are designed to handle these sorts of structures.
Spatial regression is about <em>explicitly</em> introducing space or geographical context into the statistical framework of a regression.
Conceptually, we want to introduce space into our model whenever we think it plays an important role in the process we are interested in, or when space can act as a reasonable proxy for other factors we cannot but should include in our model.
As an example of the former, we can imagine how houses at the seafront are probably more expensive than those in the second row, given their better views.
To illustrate the latter, we can think of how the character of a neighborhood is important in determining the price of a house; however, it is very hard to identify and quantify “character” <em>per se,</em> although it might be easier to get at its spatial variation, hence a case of space as a proxy.</p>
<p>Spatial regression is a large field of development in the econometrics and statistics literatures.
In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence.
For more rigorous treatments of the topics introduced here, the reader is
referred to <a href="#id3"><span class="problematic" id="id4">:cite:`Anselin_2003,Anselin_2014,Gelman_2006`</span></a>.</p>
<div class="section" id="spatial-feature-engineering">
<h3>Spatial Feature Engineering<a class="headerlink" href="#spatial-feature-engineering" title="Permalink to this headline">¶</a></h3>
<p>Using geographic information to “construct” new data is a common approach to bring in spatial information into geographic analysis.
Often, this reflects the fact that processes are not the same everywhere in the map of analysis, or that geographical information may be useful to predict our outcome of interest. In this section, we will briefly present how to use <em>spatial features</em>, or <span class="math notranslate nohighlight">\(X\)</span> variables that are constructed from geographical relationships, in a standard linear model. We discuss spatial feature engineering extensively in the next chapter, though, and the depth and extent of spatial feature engineering is difficult to overstate. In this, we will consider only the simplest of spatial features: proximity variables.</p>
<div class="section" id="proximity-variables">
<h4>Proximity variables<a class="headerlink" href="#proximity-variables" title="Permalink to this headline">¶</a></h4>
<p>For a start, one relevant proximity-driven variable that could influence our model is based on the listings proximity to Balboa Park. A common tourist destination, Balboa park is a central recreation hub for the city of San Diego, containing many museums and the San Diego zoo. Thus, it could be the case that people searching for Airbnbs in San Diego are willing to pay a premium to live closer to the park. If this were true <em>and</em> we omitted this from our model, we may indeed see a significant spatial pattern caused by this distance decay effect.</p>
<p>Therefore, this is sometimes called a <em>spatially-patterned omitted covariate</em>: geographic information our model needs to make good precitions which we have left out of our model. Therefore, let’s build a new model containing this distance to Balboa Park covariate. First, though, it helps to visualize the structure of this distance covariate itself:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">db</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="s1">'d2balboa'</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s1">'.'</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span><AxesSubplot:>
</pre></div>
</div>
<img alt="../_images/11_regression_26_1.png" src="../_images/11_regression_26_1.png" />
</div>
</div>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">base_names</span> <span class="o">=</span> <span class="n">variable_names</span>
<span class="n">balboa_names</span> <span class="o">=</span> <span class="n">variable_names</span> <span class="o">+</span> <span class="p">[</span><span class="s1">'d2balboa'</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">m2</span> <span class="o">=</span> <span class="n">spreg</span><span class="o">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">db</span><span class="p">[[</span><span class="s1">'log_price'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">db</span><span class="p">[</span><span class="n">balboa_names</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span>
<span class="n">name_y</span> <span class="o">=</span> <span class="s1">'log_price'</span><span class="p">,</span> <span class="n">name_x</span> <span class="o">=</span> <span class="n">balboa_names</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<p>Unfortunately, when you inspect the regression diagnostics and output, you see that this covariate is not quite as helpful as we might anticipate. It is not statistically significant at conventional significance levels, the model fit does not substantially change:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">m2</span><span class="o">.</span><span class="n">summary</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : None
Dependent Variable : log_price Number of Observations: 6110
Mean dependent var : 4.9958 Number of Variables : 12
S.D. dependent var : 0.8072 Degrees of Freedom : 6098
R-squared : 0.6685
Adjusted R-squared : 0.6679
Sum squared residual: 1319.522 F-statistic : 1117.9338
Sigma-square : 0.216 Prob(F-statistic) : 0
S.E. of regression : 0.465 Log likelihood : -3987.446
Sigma-square ML : 0.216 Akaike info criterion : 7998.892
S.E of regression ML: 0.4647 Schwarz criterion : 8079.504
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 4.3796237 0.0169152 258.9162210 0.0000000
accommodates 0.0836436 0.0050786 16.4698200 0.0000000
bathrooms 0.1907912 0.0110047 17.3371724 0.0000000
bedrooms 0.1507462 0.0111794 13.4842887 0.0000000
beds -0.0414762 0.0069387 -5.9774814 0.0000000
rt_Private_room -0.5529958 0.0159599 -34.6490178 0.0000000
rt_Shared_room -1.2355206 0.0384618 -32.1232754 0.0000000
pg_Condominium 0.1404588 0.0222251 6.3198282 0.0000000
pg_House -0.0133019 0.0146230 -0.9096565 0.3630396
pg_Other 0.1411756 0.0227980 6.1924442 0.0000000
pg_Townhouse -0.0457839 0.0343557 -1.3326417 0.1826992
d2balboa 0.0016453 0.0009673 1.7008587 0.0890205
------------------------------------------------------------------------------------
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 12.745
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 2710.322 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 11 317.519 0.0000
Koenker-Bassett test 11 132.860 0.0000
================================ END OF REPORT =====================================
</pre></div>
</div>
</div>
</div>
<p>And, there still appears to be spatial structure in our model’s errors:</p>
<div class="cell docutils container">
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">lag_residual</span> <span class="o">=</span> <span class="n">weights</span><span class="o">.</span><span class="n">spatial_lag</span><span class="o">.</span><span class="n">lag_spatial</span><span class="p">(</span><span class="n">knn</span><span class="p">,</span> <span class="n">m2</span><span class="o">.</span><span class="n">u</span><span class="p">)</span>
<span class="n">seaborn</span><span class="o">.</span><span class="n">regplot</span><span class="p">(</span><span class="n">m2</span><span class="o">.</span><span class="n">u</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">lag_residual</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span>
<span class="n">line_kws</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s1">'orangered'</span><span class="p">),</span>
<span class="n">ci</span><span class="o">=</span><span class="kc">None</span><span class="p">);</span>
</pre></div>
</div>
</div>
<div class="cell_output docutils container">
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
</pre></div>
</div>
<img alt="../_images/11_regression_32_1.png" src="../_images/11_regression_32_1.png" />
</div>
</div>
<p>Finally, the distance to Balboa Park variable does not fit our theory about how distance to amenity should affect the price of an Airbnb; the coefficient estimate is <em>positive</em>, meaning that people are paying a premium to be <em>further</em> from the Park. We will revisit this result later on, when we consider spatial heterogeneity and will be able to shed some light on this. Further, the next chapter is an extensive treatment of spatial fixed effects, presenting many more spatial feature engineering methods. Here, we have only showed how to include these engineered features in a standard linear modelling framework.</p>
</div>
</div>
<div class="section" id="spatial-heterogeneity">
<h3>Spatial Heterogeneity<a class="headerlink" href="#spatial-heterogeneity" title="Permalink to this headline">¶</a></h3>
<p>While we’ve assumed that our proximity variable might stand in for a difficult-to-measure premium individuals pay when they’re close to a recreational zone. However, not all neighborhoods are created equal; some neighborhoods may be more lucrative than others, regardless of their proximity to Balboa Park. When this is the case, we need some way to account for the fact that each neighborhood may experience these kinds of <em>gestalt</em>, unique effects. One way to do this is by capturing <em>spatial heterogeneity</em>. At its most basic, <em>spatial heterogeneity</em> means that parts of the model may change in different places. For example, changes to the intercept, <span class="math notranslate nohighlight">\(\alpha\)</span>, may reflect the fact that different areas have different baseline exposures to a given process. Changes to the slope terms, <span class="math notranslate nohighlight">\(\beta\)</span>, may indicate some kind of geographical mediating factor, such as when a governmental policy is not consistently applied across jurisdictions. Finally, changes to the variance of the residuals, commonly denoted <span class="math notranslate nohighlight">\(\sigma^2\)</span>, can introduce spatial heteroskedasticity. We deal with the first two in this section.</p>
<p>To illustrate spatial fixed effects, let us consider the house price example from the previous section to introduce a more general illustration for “space as a proxy”. Given we are only including two explanatory variables in the model, it is likely we are missing some important factors that play a role at determining the price at which a house is sold. Some of them, however, are likely to vary systematically over space (e.g. different neighborhood characteristics). If that is the case, we can control for those unobserved factors by using traditional dummy variables but basing their creation on a spatial rule. For example, let us include a binary variable for every neighborhood, indicating whether a given house is located within such area (<code class="docutils literal notranslate"><span class="pre">1</span></code>) or not (<code class="docutils literal notranslate"><span class="pre">0</span></code>). Mathematically, we are now fitting the following equation:</p>
<div class="math notranslate nohighlight">
\[
\log{P_i} = \alpha_r + \sum_k \mathbf{X}_{ik}\beta_k + \epsilon_i