-
Notifications
You must be signed in to change notification settings - Fork 0
/
atom.xml
1369 lines (1222 loc) · 232 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>From Learning and Evolution to Data Science</title>
<link href="https://dracodoc.github.io/atom.xml" rel="self"/>
<link href="https://dracodoc.github.io/"/>
<updated>2019-11-16T00:56:50.778Z</updated>
<id>https://dracodoc.github.io/</id>
<author>
<name>dracodoc</name>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>Reactive in Shiny</title>
<link href="https://dracodoc.github.io/2019/11/15/shiny-reactive/"/>
<id>https://dracodoc.github.io/2019/11/15/shiny-reactive/</id>
<published>2019-11-16T00:42:58.000Z</published>
<updated>2019-11-16T00:56:50.778Z</updated>
<content type="html"><![CDATA[<h2 id="intro"><a href="#Intro" class="headerlink" title="Intro"></a>Intro</h2><p> In preparing an invited talk on Shiny, I organized my experience and notes on reactive programming, and found the storyline I developed may actually be a good alternative compare to the usual tutorials on this topic. Thus I’m expanding the talk slides into a blog post and sharing it here.</p>
<a id="more"></a>
<h2 id="programming-for-user-interface-event-driving-programming"><a href="#Programming-for-User-Interface-Event-Driving-programming" class="headerlink" title="Programming for User Interface: Event Driving programming"></a>Programming for User Interface: Event Driving programming</h2><p> Programming user interface is different from some other domains, because user interface need to respond to user input and you don’t know when that will happen. Usually this means you write some logic for some possible situations, and there will be a maintained loop watching for user input, and trigger the appropriate logic when the input happens.</p>
<p> In desktop application development, the common pattern is Event Driven programming. User input will generate some event, and the event object have information about the input. You can write code for specific event and conditions, “register” the event to the system (the programming framework), and the system will trigger the code. Here the framework handle the details about event, registering, triggering, and developer only need to write code for event handling.</p>
<p> This pattern is straightforward and not hard to understand. Shiny support this pattern too (<a href="https://shiny.rstudio.com/reference/shiny/1.4.0/observeEvent.html" target="_blank" rel="external">observeEvent</a>, note sometimes you may see code examples using <code>observe</code>, which is a low level API and I believe usually there is no real reason for you to use <code>observe</code> instead of more friendly <code>observeEvent</code>.) since it’s a good approach for certain use cases.</p>
<p> There is a slight difference in Shiny <code>observeEvent</code> though. You can think it is observing data changes in the target, not really some event object (it’s possible in the underlying level implementation of Shiny framework something can be called as event object, but I think this way of understanding will help to recognize the difference and connection to the reactive programming topic later). For example, an <code>actionButton</code> click actually just increase its return value by 1, and that value change can trigger some observeEvent code. You can even write something like <code>observeEvent(1, {...})</code>, just the code will only execute once and not again.</p>
<p> If we think <code>observeEvent</code> observe data changes, it can be triggered by any kind of change, including user input (which will change the value input$widget_id), reactive expressions(we will discuss it next).</p>
<p> Summary: <strong><code>observeEvent</code> observe data changes in target expression, run the code once anything changed</strong> (there are more options control the fine details, like whether to run in initialization, if to ignore NULL etc, see help page of <code>observeEvent</code>). </p>
<pre><code>observeEvent: data changes ---trigger---> event handling code
</code></pre><p> Note the official tutorials differentiate event observer and reactive expressions mainly by side effect/calculated values. In my experience this difference is less useful than the difference of source/target of changes, the latter often determined which one you need to use, and you can have side effect in reactive expression in some valid user cases. After all, anything interacting with outside world is side effect, and we need to interact with outside world a lot in user interface programming.</p>
<p> If your reactive expression only returned some changed values and that didn’t reflect to GUI, why were the changes needed? if it did reflect to GUI, that’s still side effect, just shiny framework did the plumbing work and made the changes so the reactive expression didn’t look like did anything imperative.</p>
<p> More relevantly, should use the design principle of cohererant and loose coupling. let related event update together. if you have multiple control for one final value, better use a reactive expression instead of multiple observer.</p>
<h2 id="another-pattern-reactive-programming"><a href="#Another-pattern-Reactive-programming" class="headerlink" title="Another pattern: Reactive programming"></a>Another pattern: Reactive programming</h2><p> For more complete and detailed tutorial on reactive programming, check <a href="https://mastering-shiny.org/why-reactivity.html" target="_blank" rel="external">Hadley’s new book on Shiny</a>.</p>
<p> In this post my perspective is to introduce reactive pattern by comparing with event driving programming.</p>
<p> <strong>A reactive expression/value will automatically update itself triggered by data changes in source of changes.</strong> This automatical update is handled by Shiny framework, thus require less manual work and appears to be more magical to developers.</p>
<h3 id="reactive-expression-all-reactive-values-inside-become-source-of-changes"><a href="#Reactive-Expression-all-reactive-values-inside-become-source-of-changes" class="headerlink" title="Reactive Expression: all reactive values inside become source of changes"></a>Reactive Expression: all reactive values inside become source of changes</h3><p> observeEvent is triggered by data changes in the target expression, while a <a href="https://shiny.rstudio.com/reference/shiny/1.4.0/reactive.html" target="_blank" rel="external">reactive expression</a> update is triggered by all data changes in all reactive values inside the expression, and you don’t need to register them explicitly. </p>
<pre><code>reactive({
...
Shiny UI reactive values like input$checkbox
reactive values defined by reactiveValue()
other reactive expression()
})
dynamic data 1
dynamic data 2 ==> expression reevaluate
dynamic data 3
</code></pre><p> Note:</p>
<ul>
<li>Reactive expression look like a function, use like a function. Thus you reference it with () for the updated value, transfer it without () in some other scenarios (like Shiny module) when you are using the expression itself but not going to use the updating value immediately.</li>
</ul>
<h2 id="reactive-expression-vs-observeevent"><a href="#Reactive-Expression-Vs-observeEvent" class="headerlink" title="Reactive Expression Vs observeEvent"></a>Reactive Expression Vs observeEvent</h2><p> Compare to observeEvent, you can establish multiple -> one data update relationship in reactive expression without explicit registering, thus this is a prefered way if it met all your needs. </p>
<p> In <a href="(https://shiny.rstudio.com/reference/shiny/1.4.0/observe.html"><code>observe</code></a>) help page, there are some official comparison for these two, mainly focused on:</p>
<blockquote>
<p>it doesn’t yield a result and can’t be used as an input to other reactive expressions. Thus, observers are only useful for their side effects (for example, performing I/O).<br> Another contrast between reactive expressions and observers is their execution strategy. Reactive expressions use lazy evaluation; that is, when their dependencies change, they don’t re-execute right away but rather wait until they are called by someone else. Indeed, if they are not called then they will never re-execute. In contrast, observers use eager evaluation; as soon as their dependencies change, they schedule themselves to re-execute.</p>
</blockquote>
<p> All these are definitely valid points, but I think the deciding factor for choosing one of them should be just how you want to arrange the source of changes and eager vs lazy evaluation. With observeEvent you need to be more explicit and have more control, with reactive expression you “let it go” and everything will work smoothly if it fit the pattern.</p>
<h2 id="reactive-values"><a href="#Reactive-Values" class="headerlink" title="Reactive Values"></a>Reactive Values</h2><p> One real limit with reactive expression is that you cannot modify its value arbitrarily. It can update when source of changes changed, but always change with same expression. When you need to modify the dynamic data from another source/place/time, you need <a href="https://shiny.rstudio.com/reference/shiny/1.4.0/reactiveValues.html" target="_blank" rel="external">reactive values</a>.</p>
<p> Thus you have more control and more responsibilities with reactive values</p>
<pre><code>- read reactive value inside reactive expression
- value change ==> expression reevaluate
- write reactive value inside reactive expression
- expression reevaluate ==> value updated
- read/write same reactive value inside reactive expression?
- that will cause an infinite loop
</code></pre><h2 id="shiny-inputoutput-as-reactive-special-cases"><a href="#Shiny-input-output-as-reactive-special-cases" class="headerlink" title="Shiny input/output as reactive special cases"></a>Shiny input/output as reactive special cases</h2><ul>
<li>input value (input$slider_value) are reactive values driven by user input<ul>
<li>cannot modify it directly by assignment</li>
<li>use <a href="https://shiny.rstudio.com/reference/shiny/1.4.0/" target="_blank" rel="external">update* methods</a> to change UI status</li>
</ul>
</li>
<li>output code (renderPlot) create reactive scopes like reactive expression<ul>
<li>return value used immediately</li>
<li>if you need to reuse the value, just create a reactive expression and reference it </li>
</ul>
</li>
<li>Error: Operation not allowed without an active reactive context<ul>
<li>Every reactive value inside a reactive domain (like inside a reactive expression, output code which is reactive domain implicitly) get registered by Shiny framework behind the scene so their changes can be monitored. Thus using a reactive value outside of reactive domain will raise this error.</li>
<li>If you do need to inspect the value in debugging, or you want to read the value but don’t want the value update trigger reactive expression reevaluation, you can use <a href="https://shiny.rstudio.com/reference/shiny/1.4.0/isolate.html" target="_blank" rel="external">isolate</a>.</li>
</ul>
</li>
</ul>
<h2 id="when-more-controls-are-needed"><a href="#When-more-controls-are-needed" class="headerlink" title="When more controls are needed"></a>When more controls are needed</h2><p> The components above can be used to create sophisticated dynamic systems. However sometimes the order of changes may not be ideal with these rules.</p>
<ul>
<li>One simple case is that your downstream reactive expression/value may not have valid upstream value yet when the app UI is initialized. You can use <a href="https://shiny.rstudio.com/reference/shiny/1.4.0/req.html" target="_blank" rel="external">req</a> to hold off the related UI widget rendering before the upstream value is ready. </li>
<li><p>Sometimes you have multiple widgets updating at the same time driven by some changes, and some widget always update slower, this may cause problems. </p>
<p>For example, <code>DT</code> is one of my favorite package and I used it extensively in my app, often using the table selection to control other parts of app. When a <code>DT</code> table was updated, the row information will update after the whole table render finish, which is often the slowest one if other widgets are updating at the same time. I may have a plot is depending on some row selection value, so there will be a short time period when the row selection value are not valid and plot will render with the invalid value. Once the table finished update it will be corrected.</p>
<p>In the beginning I tried to use <a href="https://shiny.rstudio.com/reference/shiny/1.4.0/outputOptions.html" target="_blank" rel="external">priority levels</a> to adjust the order, but that seemed never work.</p>
<p>Instead you can use <a href="https://shiny.rstudio.com/reference/shiny/1.4.0/freezeReactiveValue.html" target="_blank" rel="external">freezeReactiveValue</a>, which will hold off downstream changes until the last second, so the plot will not render with the invalid value.</p>
</li>
</ul>
]]></content>
<summary type="html">
<h2 id="Intro"><a href="#Intro" class="headerlink" title="Intro"></a>Intro</h2><p> In preparing an invited talk on Shiny, I organized my experience and notes on reactive programming, and found the storyline I developed may actually be a good alternative compare to the usual tutorials on this topic. Thus I’m expanding the talk slides into a blog post and sharing it here.</p>
</summary>
<category term="R" scheme="https://dracodoc.github.io/categories/R/"/>
<category term="R" scheme="https://dracodoc.github.io/tags/R/"/>
<category term="Shiny" scheme="https://dracodoc.github.io/tags/Shiny/"/>
</entry>
<entry>
<title>Make link button with Shiny functions</title>
<link href="https://dracodoc.github.io/2017/06/03/shiny-link-button/"/>
<id>https://dracodoc.github.io/2017/06/03/shiny-link-button/</id>
<published>2017-06-04T02:25:27.000Z</published>
<updated>2017-06-04T02:33:09.772Z</updated>
<content type="html"><![CDATA[<p>You can customize Shiny to a much greater extent if you knew Shiny UI functions just generate html codes. You can make a link button with creative use of Shiny functions.<br><a id="more"></a></p>
<p>RMarkdown is the better format for the content, so please see <a href="link-button/">the rendered RMarkdown document here</a>.</p>
]]></content>
<summary type="html">
<p>You can customize Shiny to a much greater extent if you knew Shiny UI functions just generate html codes. You can make a link button with creative use of Shiny functions.<br>
</summary>
<category term="R" scheme="https://dracodoc.github.io/categories/R/"/>
<category term="R" scheme="https://dracodoc.github.io/tags/R/"/>
<category term="Shiny" scheme="https://dracodoc.github.io/tags/Shiny/"/>
</entry>
<entry>
<title>Color Sync in multiple ggplots</title>
<link href="https://dracodoc.github.io/2017/04/08/color-sync-gg/"/>
<id>https://dracodoc.github.io/2017/04/08/color-sync-gg/</id>
<published>2017-04-08T12:30:34.000Z</published>
<updated>2017-06-04T01:44:31.369Z</updated>
<content type="html"><![CDATA[<p>This is a summary about my experience on synchronize colors in multiple ggplots of same dataset. </p>
<a id="more"></a>
<p>RMarkdown is the better format for the content, so please see <a href="color_sync_ggplot/">the rendered RMarkdown document here</a>.</p>
]]></content>
<summary type="html">
<p>This is a summary about my experience on synchronize colors in multiple ggplots of same dataset. </p>
</summary>
<category term="R" scheme="https://dracodoc.github.io/categories/R/"/>
<category term="R" scheme="https://dracodoc.github.io/tags/R/"/>
<category term="ggplot" scheme="https://dracodoc.github.io/tags/ggplot/"/>
</entry>
<entry>
<title>rCartoAPI - call Carto.com API with R</title>
<link href="https://dracodoc.github.io/2017/01/21/rCarto/"/>
<id>https://dracodoc.github.io/2017/01/21/rCarto/</id>
<published>2017-01-21T22:04:26.000Z</published>
<updated>2017-01-22T01:47:44.335Z</updated>
<content type="html"><![CDATA[<h2 id="summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>My experience with Carto.com in creating web map for data analysis</li>
<li>I wrote a R package to wrap Carto.com API calls</li>
<li>Some notes on my experience of managing Gigabyte size data for mapping</li>
</ul>
<a id="more"></a>
<h2 id="introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p>Carto.com is a web map provider. I used Carto in my project because:</p>
<ol>
<li>With PostgreSQL, PostGIS as backend, you have all the power of SQL and PostGIS functions. With Mapbox you will need to do everything in JavaScript. Because you can run SQL inside the Carto website UI, it’s much easier to experiment and update.</li>
<li>The new Builder let user to create widgets for map, which let map viewers select range in date or histgram, value in categorical variable, and the map will update dynamically. </li>
</ol>
<p>Carto provide <a href="https://carto.com/docs/carto-engine/sql-api" target="_blank" rel="external">several types of API</a> for different tasks. It’s simple to construct an API call with <code>curl</code> but also very cumbersome. You also often need to use some parts of the request response, which means a lot of copy/paste. I try to replace all repetitive manual labor with programs as much as possible, so it’s only natural to do this with R.</p>
<p>There are some R package or function available for Carto API but they are either too old and broken or too limited for my usage. I developed my own R functions for every API call I used gradually, then I made it into a R package - <a href="https://github.com/dracodoc/rCartoAPI" target="_blank" rel="external">RCartoAPI</a>.</p>
<ul>
<li>upload local file to Carto</li>
<li>let Carto import a remote file by url </li>
<li>let Carto sync with a remote file</li>
<li>check sync status</li>
<li>force sync</li>
<li>remove sync connection</li>
<li>list all sync tables</li>
<li>run SQL inquiry</li>
<li>run time consuming SQL inquiry in Batch mode, check status later</li>
</ul>
<p>So it’s more focused on data import/sync and time consuming SQL inquiries. I have found it saved me a lot of time.</p>
<h3 id="carto-user-name-and-api-key"><a href="#Carto-user-name-and-API-key" class="headerlink" title="Carto user name and API key"></a>Carto user name and API key</h3><p>All the functions in the package currently require an API key from Carto. Without API key you can only do some read only operations with public data. If there is more demand I can add the keyless versions, though I think it will be even better for Carto to just provide API key in free plan.</p>
<p>It’s not easy to save sensitive information securely and conveniently at the same time. After checking <a href="http://blog.revolutionanalytics.com/2015/11/how-to-store-and-use-authentication-details-with-r.html" target="_blank" rel="external">this summary</a> and <a href="https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html" target="_blank" rel="external">the best practices vignette</a> from <code>httr</code>, I chose to save them in system environment and minimize the exposure of user name and API key. After reading from system environment, the user name and API key only exist inside the package functions, which are further wrapped in package environment, not visible from global environment.</p>
<p>Most references I found in this usage used <code>.Rprofile</code>, while I think <code>.Renviron</code> is more suitable for this need. If you want to update variables and reload them, you don’t need to touch the other part in <code>.Rprofile</code>. </p>
<p>When package is loaded it will check system environment for the user name and API key and report status. If you modified the user name and API key in <code>.Renviron</code>, just run <code>update_env()</code>. </p>
<h2 id="some-tips-from-my-experience"><a href="#Some-tips-from-my-experience" class="headerlink" title="Some tips from my experience"></a>Some tips from my experience</h2><h3 id="csv-column-type-guessing"><a href="#csv-column-type-guessing" class="headerlink" title="csv column type guessing"></a>csv column type guessing</h3><p>Carto by default will set csv column type according to column content. However sometimes column with numbers are actually categorical, and often there are leading 0s need to be kept. If Carto import these columns as number, the leading 0 information is lost and you cannot recover it by changing column type later in Carto. </p>
<p>Thus I will add quote for the columns that I want to keep them as characters, and use parameter <code>quoted_fields_guessing</code> as FALSE by default. Then Carto will not guessing type for these columns. We still want the field guessing on for other columns, especially it’s easier that Carto recognize lon/lat pair and build the geom automatically. <code>write.csv</code> will write non-numeric columns with quote by default, which is what we want. If you are using <code>fwrite</code> in <code>data.table</code>, you need to set <code>quote = TRUE</code> manually.</p>
<h3 id="update-data-after-a-map-is-created"><a href="#update-data-after-a-map-is-created" class="headerlink" title="update data after a map is created"></a>update data after a map is created</h3><p>Sometimes I may want to update the data used in a map after the map has been created, for example there are more data cleaning needed. I didn’t find a straightforward way to do this in Carto. </p>
<ul>
<li>One way is to upload the new data file with new name, then duplicate the map, change the SQL call for the data set to load the new data table. There are multiple manual steps involved, and there will be duplicated maps and data sets.</li>
<li>Another way is to set map using a sync table to a remote url, for example dropbox shared file. Then you can update the file in dropbox, let Carto to update the data. If the default sync interval is too long, there is <code>force_sync</code> function in package to force immediate sync. Note there is a 15 mins wait from last sync before force sync can work. </li>
</ul>
<p>It also worth note that by copying new version of data file into the local dropbox folder to override the old version will update the file and keep the sharing link same.</p>
<h3 id="upload-large-file-to-carto"><a href="#upload-large-file-to-Carto" class="headerlink" title="upload large file to Carto"></a>upload large file to Carto</h3><p>There is a limit of 1 million rows for single file upload to Carto. I have a data file with 4 million rows, so I have to split it into smaller chunks, upload each file, then combine them with SQL inquries. With the help of <code>rdrop2</code> package and my own package, I can do all of these automatically, which make it much easier to update the data and run the process again.</p>
<p>Compare to upload huge local file directly to Carto, I think upload to cloud probably is more reliable. I chose dropbox because the direct file link can be inferred from the share link, while I didn’t find a working method to get direct link of google drive file. </p>
<p>To run the code below you need to provide a data set. Then the verification part may need some column adjustment to pass.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="keyword">library</span>(data.table)</div><div class="line"><span class="comment"># setup rdrop2</span></div><div class="line">devtools::install_github(<span class="string">'karthik/rdrop2'</span>)</div><div class="line"><span class="keyword">library</span>(rdrop2)</div><div class="line">drop_auth()</div><div class="line"><span class="comment"># provide your data set here</span></div><div class="line">target <- data.table(dataset)</div><div class="line"><span class="comment"># use small size to test workflow first, change to full scale later</span></div><div class="line">chunk_size <- <span class="number">200</span></div><div class="line">name_prefix <- <span class="string">"bfa_sample"</span></div><div class="line">file_count <- ceiling(target[, .N] / chunk_size)</div><div class="line"><span class="comment"># generate this to be used later. note no ".csv" part here</span></div><div class="line">file_name_list <- paste0(name_prefix, <span class="string">"_"</span>, <span class="number">1</span>:file_count)</div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> <span class="number">1</span>:file_count) {</div><div class="line"> range_s <- (i - <span class="number">1</span>) * chunk_size + <span class="number">1</span></div><div class="line"> <span class="comment"># the last chunk could be of different size. R will recycle rows if not specified</span></div><div class="line"> range_e <- min(target[, .N], range_s + chunk_size - <span class="number">1</span>)</div><div class="line"> save_csv(target[range_s:range_e], file_name_list[i])</div><div class="line">}</div><div class="line"><span class="comment"># verify split data integrity</span></div><div class="line">file_list <- paste0(csv_folder, file_name_list, <span class="string">".csv"</span>)</div><div class="line">dt_list <- vector(<span class="string">"list"</span>, length(file_list))</div><div class="line"><span class="keyword">for</span> (j <span class="keyword">in</span> seq_along(file_list)) {</div><div class="line"> dt_list[[j]] <- fread(file_list[[j]])</div><div class="line">}</div><div class="line">dt <- rbindlist(dt_list)</div><div class="line"><span class="comment"># in reality, some columns types need to be converted first after reading from csv directly</span></div><div class="line">all.equal(dt, target)</div><div class="line"><span class="comment"># setup dropbox, get url.</span></div><div class="line">file_urls <- vector(mode = <span class="string">"character"</span>, length = length(file_list))</div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> seq_along(file_list)) {</div><div class="line"> drop_upload(file_list[i])</div><div class="line"> res <- drop_share(drop_search(file_name_list[i])$path, short_url = <span class="literal">FALSE</span>)</div><div class="line"> file_urls[i] <- res$url</div><div class="line">}</div><div class="line"><span class="comment"># setup dropbox sync, wait complete, get table id</span></div><div class="line"><span class="keyword">for</span> (i <span class="keyword">in</span> seq_along(file_urls)) {</div><div class="line"> res <- url_sync(convert_dropbox_link(file_urls[i]))</div><div class="line">}</div><div class="line"><span class="comment"># check result</span></div><div class="line">tables_df <- list_sync_tables_df()</div></pre></td></tr></table></figure>
<p>My case need to upload 4 200M files. Any error in the network or Carto server may prevent it finish perfectly. Upon checking the sync table I found the last file sync is not successful. I tried force sync it but failed, so I just use this code to upload and sync that file again.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># need both file_path and file_name</span></div><div class="line">file_path <- <span class="string">"your file path"</span></div><div class="line">file_name <- <span class="string">"your file name"</span></div><div class="line">drop_upload(file_path)</div><div class="line">res <- drop_share(drop_search(file_name)$path, short_url = <span class="literal">FALSE</span>)</div><div class="line">file_url <- res$url</div><div class="line"><span class="comment"># setup dropbox sync, wait complete, get table id</span></div><div class="line">res <- url_sync(convert_dropbox_link(file_url))</div><div class="line">dt <- list_sync_tables_dt()</div></pre></td></tr></table></figure>
<h3 id="merge-uploaded-chunks-with-batch-sql"><a href="#merge-uploaded-chunks-with-Batch-sql" class="headerlink" title="merge uploaded chunks with Batch sql"></a>merge uploaded chunks with Batch sql</h3><p>With all data files uploaded to Carto, now we need to merge them. Because I tested with small size sample first, I can test my sql inquiry in the web page directly (click a data set to open the data view, switch to sql view to run sql inquiry). After that I run the sql inquiry with my R package. With everything works I change the data set to the full scale data and run the whole process again.</p>
<p>I used a template for sql inquiries because I need to apply them for small sample file first, then larger full scale file later. With a template I can change the table name easily.</p>
<p>Carto expect a table <a href="https://github.com/CartoDB/cartodb-postgresql/blob/master/doc/cartodbfy-requirements.rst" target="_blank" rel="external">matching some special schema to work</a>, including a <code>cartodb_id</code> column. When you upload a file into Carto, Carto will convert the data automatically in the importing process. Since we are creating a new table by sql API directly, this new table didn’t go through that process and is not ready for Carto mapping yet. We need to drop the <code>cartodb_id</code> column, <a href="https://github.com/CartoDB/cartodb/wiki/creating-tables-though-the-SQL-API" target="_blank" rel="external">run <code>cdb_cartodbfytable</code> function to make the table ready</a>. Only after this finished you can see the result table in the data set page of Carto.</p>
<p>The sql inquiries we used here need some time to finish. With rCartoAPI you can run the inquiries and check the job status easily.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># pattern of uploaded file name</span></div><div class="line">file_name_pattern <- <span class="string">"data_set"</span></div><div class="line">tables_dt <- list_sync_tables_dt()</div><div class="line"><span class="comment"># get the full table name for uploaded files in Carto</span></div><div class="line">file_name_list <- tables_dt[order(name)][str_detect(name, file_name_pattern), name]</div><div class="line">result_table <- <span class="string">"data_set_all"</span></div><div class="line"><span class="comment"># inquiries in two parts</span></div><div class="line">inquiry_list <- vector(mode = <span class="string">"character"</span>, length = <span class="number">2</span>)</div><div class="line"><span class="comment"># merge the table, cartodb_id column need to be dropped and generated again for merged dataset, because it is a row id column.</span></div><div class="line">inquiry_list[<span class="number">1</span>] <- <span class="string">"DROP TABLE IF EXISTS __result_table;</span></div><div class="line">CREATE TABLE __result_table AS </div><div class="line"> SELECT * FROM __table_1</div><div class="line"> union</div><div class="line"> SELECT * FROM __table_2</div><div class="line"> union</div><div class="line"> SELECT * FROM __table_3</div><div class="line"> union</div><div class="line"> SELECT * FROM __table_4</div><div class="line"> union</div><div class="line"> SELECT * FROM __table_5;</div><div class="line">ALTER TABLE __result_table</div><div class="line"> DROP COLUMN cartodb_id; "</div><div class="line"><span class="comment"># make a plain table ready for Carto. need your Carto user name here</span></div><div class="line">inquiry_list[<span class="number">2</span>] <- <span class="string">"select cdb_cartodbfytable('your user name', '__result_table')"</span></div><div class="line"><span class="comment"># str_replace_all named pair of pattern:replacement.</span></div><div class="line">inq <- lapply(inquiry_list, <span class="keyword">function</span>(x) str_replace_all(x, </div><div class="line"> c(<span class="string">"__result_table"</span> = result_table, </div><div class="line"> <span class="string">"__table_1"</span> = file_name_list[<span class="number">1</span>], </div><div class="line"> <span class="string">"__table_2"</span> = file_name_list[<span class="number">2</span>],</div><div class="line"> <span class="string">"__table_3"</span> = file_name_list[<span class="number">3</span>], </div><div class="line"> <span class="string">"__table_4"</span> = file_name_list[<span class="number">4</span>],</div><div class="line"> <span class="string">"__table_5"</span> = file_name_list[<span class="number">5</span>])))</div><div class="line"><span class="comment"># run batch job 1, merge tables</span></div><div class="line">job <- sql_batch_inquiry_id(inq[[<span class="number">1</span>]])</div><div class="line">sql_batch_check(job)</div><div class="line"><span class="comment"># check merging result</span></div><div class="line">sql_inquiry_dt(<span class="string">"select * from data_set_all limit 2"</span>)</div><div class="line">sql_inquiry_dt(<span class="string">"select count(*) from data_set_all"</span>)</div><div class="line"><span class="comment"># run batch job 2, cartodbfy</span></div><div class="line">job_2 <- sql_batch_inquiry_id(inq[[<span class="number">2</span>]])</div><div class="line">sql_batch_check(job_2)</div><div class="line"><span class="comment"># check result</span></div><div class="line">sql_inquiry_dt(<span class="string">"select * from data_set_all limit 2"</span>)</div></pre></td></tr></table></figure>
<p>After this I can create map with the merged data set. However the map performance is not ideal. I learned that you can <a href="https://carto.com/docs/tips-and-tricks/back-end-data-performance" target="_blank" rel="external">create overviews to improve performance</a> in this case.</p>
<p>So I can drop the overviews for the uploaded chunks, which were created automatically in importing process but we don’t need it. Then create overview for the merged table.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># optimization for big table</span></div><div class="line">sql_inquiry(<span class="string">"select cdb_dropoverviews('table_1'); "</span>)</div><div class="line">sql_inquiry(<span class="string">"select cdb_dropoverviews('table_2'); "</span>)</div><div class="line">sql_inquiry(<span class="string">"select cdb_dropoverviews('table_3'); "</span>)</div><div class="line">sql_inquiry(<span class="string">"select cdb_dropoverviews('table_4'); "</span>)</div><div class="line">sql_inquiry(<span class="string">"select cdb_dropoverviews('table_5'); "</span>)</div><div class="line">job_4 <- sql_batch_inquiry_id(<span class="string">"select cdb_createoverviews('data_set_all'); "</span>)</div><div class="line">sql_batch_check(job_4)</div><div class="line"></div></pre></td></tr></table></figure>
<p>Later I found I want to add a year column that work as categorical instead of numerical. Even this simple process is very slow for table this large. I have to use Batch sql inquiry for this. I also need to update the overview for the table after this change to data.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># add year categorical</span></div><div class="line">job_5 <- sql_batch_inquiry_id(<span class="string">"alter table data_set_all</span></div><div class="line"> add column by_year varchar(25)")</div><div class="line">sql_batch_check(job_5)</div><div class="line">job_6 <- sql_batch_inquiry_id(<span class="string">"update data_set_all</span></div><div class="line"> set by_year = to_char(year, '9999')")</div><div class="line">sql_batch_check(job_6)</div><div class="line"><span class="comment"># run overview again</span></div><div class="line">sql_inquiry(<span class="string">"select cdb_dropoverviews('data_set_all'); "</span>)</div><div class="line">job_7 <- sql_batch_inquiry_id(<span class="string">"select cdb_createoverviews('data_set_all'); "</span>)</div><div class="line">sql_batch_check(job_7)</div></pre></td></tr></table></figure>]]></content>
<summary type="html">
<h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>My experience with Carto.com in creating web map for data analysis</li>
<li>I wrote a R package to wrap Carto.com API calls</li>
<li>Some notes on my experience of managing Gigabyte size data for mapping</li>
</ul>
</summary>
<category term="R" scheme="https://dracodoc.github.io/categories/R/"/>
<category term="R" scheme="https://dracodoc.github.io/tags/R/"/>
<category term="Data Science" scheme="https://dracodoc.github.io/tags/Data-Science/"/>
<category term="Map" scheme="https://dracodoc.github.io/tags/Map/"/>
<category term="Carto" scheme="https://dracodoc.github.io/tags/Carto/"/>
</entry>
<entry>
<title>RStudio addin - extend RStudio in your way</title>
<link href="https://dracodoc.github.io/2016/08/10/rstudio-addin/"/>
<id>https://dracodoc.github.io/2016/08/10/rstudio-addin/</id>
<published>2016-08-10T17:57:23.000Z</published>
<updated>2016-08-30T17:55:16.329Z</updated>
<content type="html"><![CDATA[<h2 id="rstudio-addins-first-attempt"><a href="#RStudio-addins-first-attempt" class="headerlink" title="RStudio addins - first attempt"></a>RStudio addins - first attempt</h2><p>Recently I found RStudio began to provide addin mechanism. The examples looked simple, and the addin API easy to use. I immediately started to try writing one by myself. It will be a good practice project for writing R package, and I can implement some features I wanted but not in RStudio’s high priority list.</p>
<a id="more"></a>
<p>My first idea came from a long time frustration of using <code>Ctrl+Enter</code> to run current statement in console. With ggplot code like this, <code>Ctrl+Enter</code> only send one line with your cursor.</p>
<pre><code>ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1) +
coord_polar() +
facet_wrap( ~ clarity)
</code></pre><p>I submitted a feature request for this to RStudio support, though I didn’t expect it to be implemented soon since they must have lots of stuff in list. </p>
<p>After a little bit research on how R can recognize multiple line statement to be single statement, I felt the problem was not easy but doable. </p>
<p>R know a statement is not finished yet even with newline if it found</p>
<ul>
<li>a string started with quotation mark</li>
<li>an operator like <code>+</code>, <code>/</code>, <code><-</code> in the end of line</li>
<li>a function call started with <code>(</code></li>
</ul>
<p>I started to write regular expressions and work on the addin mechanism. After some time I began to test on sample code, then I found RStudio can send multiple line statement with <code>Ctrl+Enter</code> correctly! </p>
<p>Turned out I just upgraded RStudio to the latest preview beta version because of requirement of addin development, and the latest preview version implemented my feature suggestion already. I knew it could be easy from RStudio angle because RStudio has analyzed every line of code, and should have many information readily available.</p>
<h2 id="mischelper"><a href="#mischelper" class="headerlink" title="mischelper"></a>mischelper</h2><p>With my initial target crossed off, I tried to find some other usages that could use an addin. </p>
<ul>
<li><p>First candidate came from my experience of copying some text from PDF as notes: I’d like to <code>remove the hard line breaks</code> from PDF. To do this I need to separate the hard word wrap from the normal paragraphs. With some experimentations on regular expressions this was done in a short time. I also added option to insert empty line between paragraphs.</p>
<p> <img src="unwrap.gif" alt="unwrap"></p>
</li>
<li><p>I felt the <code>remove hard line break</code> feature is too trivial to be an independent addin, so I added yet another trivial feature: flip the windows path separator <code>\</code> into <code>/</code>. Thus I can copy a file or folder full path in Total Commander, paste it into R script with one click.</p>
<p> <img src="flip.gif" alt="flip"></p>
</li>
<li><p>Still not satisfied, I found a really useful function later: if you want to do a simple benchmark or measuring time spent on code, the primitive method is to use <code>proc.time()</code>. Or you could use the great <a href="https://cran.r-project.org/web/packages/microbenchmark/index.html" target="_blank" rel="external"><code>microbenchmark</code></a> package, which would run the code several times to get better statistics.<br> To use <code>microbenchmark</code>, you need to wrap your code or function like this:</p>
<pre><code>microbenchmark::microbenchmark({your code or function}, times = 100)
</code></pre><p> It’s not hard if you are just measuring a function, but I found I wanted to measure a code chunk instead of function in most times. Because it’s harder to interactively debug code once it was wrapped into a function, I always fully test code before it became a function. Sometimes I may also want to test different code chunks, thus the usage of <code>microbenchmark</code> became quite laborious.</p>
<p> I always want to automate everything as much as I can, and this case is a perfect usage. Just select the code I want to benchmark, one keyboard short cut or menu click will wrap them and microbenchmark in console. Since the code in source editor is not changed, I can continue coding or select different code chunk freely without any extra editing.</p>
<p> <img src="benchmark.gif" alt="microbenchmark"></p>
</li>
<li><p>In similar spirit, I wrote another function to use the profiler provided by RStudio. </p>
</li>
</ul>
<p>Now my addin have enough features, and I named it as <a href="https://github.com/dracodoc/mischelper" target="_blank" rel="external"><code>mischelper</code></a> since the features are quite random. I’m not sure if end user will need all of them. Installing the addin will add 5 menu items in addin menu, and the menu can become quite busy quickly. There is no menu organization mechanism like menu folder available yet, though you can edit the menu registration file manually to remove the feature you don’t need from the list.</p>
<h2 id="namebrowser"><a href="#namebrowser" class="headerlink" title="namebrowser"></a>namebrowser</h2><p>The features I developed above are very simple. Though another idea I had turned out to be much more complicated.</p>
<p>The motivation came from my experience of learning R packages. There are thousands of R packages and you do need to use quite some of them. Sometimes I knew a method or dataset exist but not sure which package it is in, especially when there are several related candidates, like <code>plyr</code>, <code>dplyr</code>, <code>tidyr</code> etc. R help will suggest to use <code>??</code> when it cannot find the name, but <code>??</code> seemed to be a full text search, which are slow and return too many irrelevant results.</p>
<p>I used to code Java in IntelliJ IDEA. One feature called <code>auto import</code> can:</p>
<ol>
<li>Automatically add import statements for all classes that are found in the pasted block of code and are not imported in the current class yet</li>
<li>Automatically display import pop-up dialog box when typing the name of a symbol that lacks import statement.</li>
</ol>
<p>I made a <a href="https://support.rstudio.com/hc/en-us/community/posts/212206388-automatically-load-packages-like-the-auto-import-in-IntelliJ-IDEA" target="_blank" rel="external">feature request</a> to RStudio again. Though after some research I found this task is not a easy one. In java there are probably not much ambiguity about which class to load since the names are often unique, while in R we have many functions shared same names across packages. User have to check options and make decision, so it’s impossible to load package automatically. The only solution is to provide a database browser to check and search names.</p>
<p>It will need quite some tedious work to maintain a database of names in packages, especially since the packages installed can change, upgrade or be removed from time to time. The method I tested need to load and attach each package before scanning, then there will be the error <code>maximal number of DLLs reached</code> pretty soon. I made extra efforts to unload packages properly after scanning, but there would still be some packages cannot be unloaded because of dependency from other loaded packages. Finally I built up a work flow to scan hundreds of packages, then started to work on a browser to search the name table.</p>
<p>With Shiny and DT it is relatively easy to get a working prototype running, though anything special customization that I wanted to do took lots of efforts to search, read and experiment on every little piece of information. After a lot of revisions I finally got <a href="https://github.com/dracodoc/namebrowser" target="_blank" rel="external">a satisfying version here</a>.</p>
<p><img src="search_normal_prefix.gif" alt="search_normal_prefix"></p>
<p><img src="search_regex_lib.gif" alt="search_regex_lib"></p>
<p><img src="search_symbol.gif" alt="search_symbol"></p>
<h2 id="addin-list"><a href="#addin-list" class="headerlink" title="addin list"></a>addin list</h2><p>I think RStudio addin is a great method to allow users to add features into RStudio based on their own needs. Although it’s still in its infancy stage, there are many good addins popped up already. You can check out <a href="https://github.com/daattali/addinslist" target="_blank" rel="external">addinlist</a>, which listed most known addins. You can also install it as a RStudio addin to manage addin installation. Some addins look very promising, like the <a href="https://github.com/daattali/addinslist" target="_blank" rel="external">ggplot theme assist</a>, which let you customize ggplot2 themes interactively.</p>
]]></content>
<summary type="html">
<h2 id="RStudio-addins-first-attempt"><a href="#RStudio-addins-first-attempt" class="headerlink" title="RStudio addins - first attempt"></a>RStudio addins - first attempt</h2><p>Recently I found RStudio began to provide addin mechanism. The examples looked simple, and the addin API easy to use. I immediately started to try writing one by myself. It will be a good practice project for writing R package, and I can implement some features I wanted but not in RStudio’s high priority list.</p>
</summary>
<category term="R" scheme="https://dracodoc.github.io/categories/R/"/>
<category term="R" scheme="https://dracodoc.github.io/tags/R/"/>
<category term="Data Science" scheme="https://dracodoc.github.io/tags/Data-Science/"/>
<category term="RStudio" scheme="https://dracodoc.github.io/tags/RStudio/"/>
</entry>
<entry>
<title>Data Cleaning Part 2 - Geocoding Addresses, Double The Performance By Cleaning</title>
<link href="https://dracodoc.github.io/2016/02/03/data-cleaning-geocode/"/>
<id>https://dracodoc.github.io/2016/02/03/data-cleaning-geocode/</id>
<published>2016-02-03T21:17:59.000Z</published>
<updated>2016-08-19T13:55:46.098Z</updated>
<content type="html"><![CDATA[<h2 id="summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>This is my second post on topic of Data Cleaning. </li>
<li>Cleaning addresses format turned out to have a substantial positive impact on Geocoding performance.</li>
<li>Deep understandings of address format standard is needed to deal with all kinds of special cases.</li>
</ul>
<a id="more"></a>
<h2 id="introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p>I discussed a lot of interesting findings I discovered in NYC Taxi Trip data in <a href="dracodoc.github.io/2016/01/31/data-cleaning/">last post</a>. However it was not clear whether the cleaning added much value to the analysis other than some anomaly records were removed, and you can always check the outliers for any calculation and remove them when appropriate.</p>
<p>Actually there are some times that the data cleaning can have great benefits. I was <a href="http://dracodoc.github.io/2015/11/17/Geocoding/">geocoding lots of addresses from public data</a> recently, and found cleaning the addresses almost doubled the geocoding performance. This effect is not really mentioned anywhere as far as I know, and I only have a theory about how that is possible.</p>
<p>In short, I was feeding address strings to PostGIS Tiger Geocoder extension for geocoding.</p>
<p><img src="http://dracodoc.github.io/2015/11/19/Script-workflow/NFIRS_data_sample.png" alt="address format"></p>
<h2 id="clean-addresses-have-much-better-geocoding-performance"><a href="#Clean-Addresses-Have-Much-Better-Geocoding-Performance" class="headerlink" title="Clean Addresses Have Much Better Geocoding Performance"></a>Clean Addresses Have Much Better Geocoding Performance</h2><p>Simple assembling the columns could have lots of dirty inputs which will interfere with the Geocoder parsing. I first did one pass Geocoding on 2010 data, then checked the geocoding results. I filtered many type of dirty inputs that caused problems and cleaned them up. Using the cleaning routine on other years’ data, the geocoding performance doubled. </p>
<table>
<thead>
<tr>
<th style="text-align:left">NFIRS Data Year</th>
<th style="text-align:left">Addresses Count</th>
<th>Time Used</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">2009</td>
<td style="text-align:left">1,767,797</td>
<td>6.3 days</td>
</tr>
<tr>
<td style="text-align:left">2010</td>
<td style="text-align:left">1,829,731</td>
<td>14.28 days</td>
</tr>
<tr>
<td style="text-align:left">2011</td>
<td style="text-align:left">1,980,622</td>
<td>7.06 days</td>
</tr>
<tr>
<td style="text-align:left">2012</td>
<td style="text-align:left">1,843,434</td>
<td>6.57 days</td>
</tr>
<tr>
<td style="text-align:left">2013</td>
<td style="text-align:left">1,753,145</td>
<td>6.51 days</td>
</tr>
</tbody>
</table>
<p>I didn’t find anybody mentioned this kind of performance gain in my thorough research on Geocoding performance tuning. Somebody suggested to normalize address first, but that didn’t help on performance because the Geocoder actually will normalize address input anyway, unless your normalize procedure is vastly better than the built-in normalizer. My theory about this performance gain is as follows:</p>
<ol>
<li>Postgresql PostGIS server will try to cache all the data needed for geocoding in RAM. My Geocoding server can hold 1 ~ 2 states’ data in RAM, so I split the input addresses by states. Every input file are single state only. Ideally the server will not need to read from disk in most of time.</li>
<li>The problem is there are lots of addresses that have wrong zip code or city. The Geocoder can still process them but it will be much more slower because the Geocoder need to scan in a much broader range. It seemed that it will scan all states even if the state information is correct. I didn’t find a way to limit the scan range to a known state, and this was confirmed by the Geocoder author.</li>
<li>The problematic addresses are scattered in the input file. Every time when the Geocoder meet them, it will scan all states and mess up the perfect cache, which caused lots of performance drop on the good addresses followed.</li>
<li>With the cleaning procedure in use, the bad address are either removed from input or collected into a special input file, separated from the good addresses. Now the Geocoder can process the good addresses much faster.</li>
</ol>
<h2 id="all-the-format-errors"><a href="#All-the-format-errors" class="headerlink" title="All the format errors"></a>All the format errors</h2><p>Here are the cleaning procedures I used. In the end I filtered and cleaned about 14% of data in many types.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># loading data and preparing address string</span></div><div class="line">data_year = <span class="string">'2010'</span></div><div class="line"><span class="comment"># create year directory, load original address data, change year number here.</span></div><div class="line">load(paste0(<span class="string">'data/address/'</span>, data_year, <span class="string">'_formated_addresses.Rdata'</span>)) </div><div class="line">setnames(address,<span class="string">'ZIP5'</span>, <span class="string">'zip'</span>)</div><div class="line">address[, row_seq := as.numeric(row.names(address))]</div><div class="line">setkey(address, zip)</div><div class="line">address[, address_type := <span class="string">'a'</span>] <span class="comment"># type 1, 3,4,5 as addresses can be geocoded.</span></div><div class="line">address[LOC_TYPE == <span class="string">'2'</span>, address_type := <span class="string">'i'</span>] <span class="comment"># to be combined with intersections in type 1 as intersections input</span></div><div class="line">address[LOC_TYPE %<span class="keyword">in</span>% c(<span class="string">'6'</span>, <span class="string">'7'</span>), address_type := <span class="string">'n'</span>] <span class="comment"># ignore 6,7</span></div><div class="line"><span class="comment"># original reference, change input string instead of original fields if possible</span></div><div class="line">address[, original_address :=</div><div class="line"> paste0(NUM_MILE,<span class="string">' '</span>, STREET_PRE,<span class="string">' '</span>, STREETNAME,<span class="string">' '</span>, STREETTYPE, </div><div class="line"> <span class="string">' '</span>, STREETSUF, <span class="string">' '</span>,APT_NO, <span class="string">', '</span>, CITY, <span class="string">', '</span>, STATE_ID, <span class="string">' '</span>, zip)] </div></pre></td></tr></table></figure>
<p>There are many manually inputed symbols for NA:</p>
<pre><code>> head(str_subset(address$original_address, "N/A"))
[1] "55 Margaret ST N/A, Monson, MA 01057" "55 Margaret ST N/A, Monson, MA 01057"
[3] "1657 WORCESTER RD N/A, FRAMINGHAM, MA 01701" "132 UNION AV N/A, FRAMINGHAM, MA 01702"
[5] "N/A OAKLAND BEACH AV , Warwick, RI 02889" "00601 MERRITT 7 N/A , NORWALK, CT 06850"
> head(str_subset(address$original_address, "null"))
[1] "96 Walworth ST null, Saratoga Springs, NY 12866" "197 S Broadway null, Saratoga Springs, NY 12866"
[3] "640 West Broadway , Conconully, WA 98819" "58 W Fork Rd , Conconully, WA 98819"
[5] " Mineral Hill Rd , Conconully, WA 98819" "225 Conconully ST , OKANOGAN, WA 98840"
</code></pre><p>Because ‘NA’ or ‘na’ could be a valid part in address string, it’s better to clean them before concatenating fields into one address string. </p>
<pre><code>> head(str_subset(address$original_address, "NA"))
[1] "7821 W CINNABAR AV , PEORIA, AZ 00000" "7818 W PINNACLE PEAK RD , PEORIA, AZ 00000"
[3] "8828 W SANNA ST , PEORIA, AZ 00000" "8221 W DEANNA DR , PEORIA, AZ 00000"
[5] "2026 W NANCY LN , PHOENIX, AZ 00000" "3548 E HELENA DR , PHOENIX, AZ 00000"
</code></pre><p>Once I finished cleaning on fields, I will prepare a cleaner address string and do the further cleaning in that concatenated string. That’s why I concatenated all original fields into <code>original_address</code>, which is for reference in case some fields changed in later process.</p>
<p>Most other cleaning process are better done in the whole string, because some input may go to wrong fields, like street number in street name instead of street number column. With the whole string this kind of error doesn’t matter any more.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># remove all kinds of input for NA</span></div><div class="line">str_subset(address$original_address, <span class="string">"N/A"</span>)</div><div class="line"><span class="keyword">for</span> (j <span class="keyword">in</span> seq_len(ncol(address)))</div><div class="line"> set(address,which(is.na(address[[j]]) | </div><div class="line"> (address[[j]] %<span class="keyword">in</span>% c(<span class="string">'N/A'</span>,<span class="string">'n/a'</span>, <span class="string">'NA'</span>,<span class="string">'na'</span>, <span class="string">'NULL'</span>, <span class="string">'null'</span>))),j,<span class="string">''</span>) </div></pre></td></tr></table></figure>
<p>Many addresses’ zip code are wrong. </p>
<pre><code>> sample(address[!grep('\\d\\d\\d\\d\\d', zip), zip], 20)
[1] "" "06" "" "" "625" "021" "33" "021" "461" "" "021" "2008" "970" "" "11" "021" "021"
[18] "9177" "" "021"
</code></pre><p>The Geocoder can process address without zip code, but it have to be format like ‘00000’.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># ---- some zip are invalid ----</span></div><div class="line">address[!grep(<span class="string">'\\d\\d\\d\\d\\d'</span>, zip), <span class="string">':='</span> (zip = <span class="string">'00000'</span>, address_type = <span class="string">'az'</span>)] </div></pre></td></tr></table></figure>
<p>After the above 2 steps of direct modifying address fields, I prepared the address string and will process the whole string in all later cleaning.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># ---- prepare address string (ignore apt_no) ---- </span></div><div class="line">address[, input_address :=</div><div class="line"> paste0(NUM_MILE,<span class="string">' '</span>, STREET_PRE,<span class="string">' '</span>, STREETNAME,<span class="string">' '</span>, STREETTYPE, </div><div class="line"> <span class="string">' '</span>, STREETSUF, <span class="string">' '</span>, <span class="string">', '</span>, CITY, <span class="string">', '</span>, STATE_ID, <span class="string">' '</span>, zip)] </div><div class="line">address[, input_address := str_trim(gsub(<span class="string">"\\s+"</span>,<span class="string">" "</span>,input_address))]</div></pre></td></tr></table></figure>
<p>Some addresses are empty. </p>
<pre><code>> head(address[STATE_ID == '' & STREETNAME == '', original_address])
[1] " , , " " , , " " , , " " , , " " , , " " , , "
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># ---- ignore empty rows, most with empty state and zip ---- </span></div><div class="line"><span class="comment"># may came from duplicate records for same event from 2 dept</span></div><div class="line">address[STATE_ID == <span class="string">''</span> & STREETNAME == <span class="string">''</span>, address_type := <span class="string">'e'</span>] </div></pre></td></tr></table></figure>
<p>There are lots of usage of speical symobls like <code>/</code>, <code>@</code>, <code>&</code>, <code>*</code> in input which will interfere with the Geocoder.</p>
<pre><code>> sample(address[LOC_TYPE == '1' & str_detect(address$input_address, "[/|@|&]"), input_address], 10)
[1] "743 CHENANGO ST , BINGHAMTON/FENTON, NY 13901" "123/127 tennyson , highland park, MI 48203"
[3] "318 1/2 McMILLEN ST , Johnstown, PA 15902" "712 1/2 BURNSIDE DR , GARDEN CITY, KS 67846"
[5] "m/m143 W Interstate 16 , Ellabell, GA 31308" "12538 Greensbrook Forest DR , Houston / Sheldon, TX 77044"
[7] "F/O 1179 CASTLEHILL AVE , New York City, NY 10462" "509 1/2 N Court , Ottumwa, IA 52501"
[9] "7945 Larson , Hereford/Palominas, AZ 85615" "1022 1/2 N Langdon ST , MITCHELL, SD 57301"
</code></pre><p>First I remove all the <code>1/2</code> since the Geocoder cannot recognize them, and removing them will not affect the Geocoding result accuracy.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(address$input_address, <span class="string">"1/2"</span>), </div><div class="line"> input_address := str_replace_all(input_address, <span class="string">"1/2"</span>, <span class="string">""</span>)]</div></pre></td></tr></table></figure>
<p>Some used <code>*</code> to label intersections, which I will use different Geocoding scripts to process later.</p>
<pre><code>> head(address[str_detect(input_address, "[a-zA-Z]\\*[a-zA-Z]"), input_address])
[1] "16 MC*COOK PL , East Lyme, CT 06333" "1236 WAL*MART PLZ , PHILLIPSBURG, NJ 08865"
[3] "0 GREENSPRING AV*JFX , BROOKLANDVILLE, MD 21022" "0 BELFAST RD*SHAWAN RD , COCKEYSVILLE, MD 21030"
[5] "0 SHAWAN RD*WARREN RD , COCKEYSVILLE, MD 21030" "0 SHAWAN RD*BELFAST RD , COCKEYSVILLE, MD 21030"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(input_address, <span class="string">"[a-zA-Z]\\*[a-zA-Z]"</span>), address_type := <span class="string">'i_*'</span>]</div></pre></td></tr></table></figure>
<p>Similarly filter other special symbols.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[address_type == <span class="string">'a'</span> & str_detect(address$input_address, <span class="string">"[/|@|&]"</span>),</div><div class="line"> address_type := <span class="string">'i_/@&'</span>]</div></pre></td></tr></table></figure>
<p>Many addresses used milepost numbers, which is a miles count along highway. They are not street addresses and cannot be processed by the Geocoder. There are all kinds of usage to record this type of address.</p>
<pre><code>> head(str_subset(address$input_address, "(?i)milepost"))
[1] "452.2E NYS Thruway Milepost , Angola, NY 14006" "447.4W NYS Thruway Milepost , Angola, NY 14006"
[3] "446W NYS Thruway Milepost , Angola, NY 14006" "447.4 NYS Thruway Milepost , Angola, NY 14006"
[5] "444.1W NYS Thruway Milepost , Angola, NY 14006" "I-94 MILEPOST 68 , Eau Claire, WI 54701"
> head(str_subset(address$input_address, "\\bmile\\b|\\bmiles\\b"))
[1] "2.5 mile Schillinger RD , T8R3 NBPP, ME 00000" "cr 103(2 miles west of 717) , breckenridge, TX 00000"
[3] "Interstate 93 south mile mark , WINDHAM, NH 03087" "183 lost mile rd. , parsonfield, ME 04047"
[5] "168 lost mile RD , w.newfield, ME 04095" "20 mile stream rd , proctorsville, VT 05153"
</code></pre><p>Note it’s still possible to have some valid street address with <code>mile</code> as a word in address(my regular expression only check when <code>mile</code> is a whole word, not part of word), but it should be very rare and difficult to separate the valid addresses from the milepost usage. So I’ll just ignore all of them.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(paste0(NUM_MILE, STREETNAME), <span class="string">"\\bmile\\b|\\bmiles\\b"</span>), address_type := <span class="string">'m'</span>]</div><div class="line">address[str_detect(address$input_address, <span class="string">"(?i)milepost"</span>), address_type := <span class="string">'m'</span>]</div></pre></td></tr></table></figure>
<p>Another special format of address is grid style address. I decided to remove the grid number part and keep the rest of address. The Geocoder will get a rough location for that street or city, which is still helpful for my purpose. The Geocoding match score will separate this kind of rough match from the exact match of street addresses.</p>
<blockquote>
<p>Grid-style Complete Address Numbers (Example: “N89W16758”). In certain communities in and around southern Wisconsin, Complete Address Numbers include a map grid cell reference preceding the Address Number. In the examples above, “N89W16758” should be read as “North 89, West 167, Address Number 58”. “W63N645” should be read as “West 63, North, Address Number 645.” The north and west values specify a locally-defined map grid cell with which the address is located. Local knowledge is needed to know when the grid reference stops and the Address Number begins.<br>Page 37, <a href="https://www.fgdc.gov/standards/projects/FGDC-standards-projects/street-address/index_html" target="_blank" rel="external">United States Thoroughfare, Landmark, and Postal Address Data Standard</a></p>
</blockquote>
<p>Most are WI and MN addresses. Except the <code>E003</code> NY address, I’m not sure what does that means. Since the Geocoder cannot handle it either, they can be removed.</p>
<pre><code>> sample(address[str_detect(address$input_address, "^[NSWEnswe]\\d"), input_address], 10)
[1] "W26820 Shelly Lynn DR , Pewaukee, WI 53072" "E14 GATE , St. Paul, MN 55111"
[3] "W5336 Fairview ROAD , Monticello, WI 53570" "W22870 Marjean LA , Pewaukee, WI 53072"
[5] "E003 , New York City, NY 10011" "W15085 Appleton AVE , Menomonee Falls, WI 53051"
[7] "N7324 Lake Knutson RD , Iola, WI 54945" "N10729 Hwy 17 S. , Rhinelander, WI 54501"
[9] "N2494 St. Hwy. 162 , La Crosse, WI 54601" "N2639 Cty Hwy Z , Palmyra, WI 53156"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(address$input_address, <span class="string">"^[NSWEnswe]\\d"</span>) & address_type == <span class="string">'a'</span>, </div><div class="line"> address_type := <span class="string">'ag'</span>]</div><div class="line">address[address_type == <span class="string">'ag'</span>, </div><div class="line"> input_address := str_replace(input_address, <span class="string">"^[NSWEnswe]\\d\\w*\\s"</span>, <span class="string">""</span>)]</div></pre></td></tr></table></figure>
<p>Some addresses have double quotes in it. Paired double quotes can be handled by the csv and Geocoder, but single double quote will cause problem for csv file.</p>
<pre><code>> sample(address[str_detect(input_address, '"'), input_address], 10)
[1] "317 IND \"C\" line at 14th ST , New York City, NY 10011" "750 W \"D\" AVE , Kingman, KS 67068"
[3] "HWY \"32\" , SHEBOYGAN, WI 53083" "22796 \"H\" DR N , Marshall, MI 49068"
[5] "5745 CR 631 \"C\" ST , Bushnell, FL 33513" "CTY \"MM\" , HOWARDS GROVE, WI 53083"
[7] "\"BB\" HWY , West Plains, MO 65775" "I-55 (MAIN TO HWY \"M\") , Imperial, MO 63052"
[9] "3400 Wy\"East RD , Hood River, OR 97031" "6555 Hwy \"D\" , parma, MO 63870"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># remove single double quote</span></div><div class="line">address[str_detect(input_address, <span class="string">'(?m)(^[^"]*)"([^"]*$)'</span>) & </div><div class="line"> str_detect(address_type, <span class="string">"^a"</span>), address_type := <span class="string">'aq'</span>]</div><div class="line">address[address_type == <span class="string">'aq'</span>, </div><div class="line"> input_address := str_replace_all(input_address, <span class="string">'(?m)(^[^"]*)"([^"]*$)'</span>, <span class="string">"\\1\\2"</span>)]</div></pre></td></tr></table></figure>
<p>Some addresses used (), which cause problems for the Geocoder. The stuff inside () can be removed.</p>
<pre><code>> sample(address[str_detect(address$input_address, "\\(.*\\)"), input_address], 10)
[1] "hwy 56 (side of beersheba mt) , beersheba springs, TN 37305"
[2] "805 PARKWAY (DOWNTOWN) RD , Gatlinburg, TN 37738"
[3] "3409 JAMESWAY DR SW , Bernalillo (County), NM 87105"
[4] "96 Arroyo Hondo Road , Santa Fe (County), NM 87508"
[5] "3555 Dobbins Bridge RD , Anderson (County), SC 29625"
[6] "KARPER (12100-14999) RD , MERCERSBURG, PA 17236"
[7] "15.5 I-81 (10001-16000) LN N , Chambersburg, PA 17201"
[8] "30 Wintergreen DR , Beaufort (County), SC 29906"
[9] "305 Rosecrest RD , Spartanburg (County), SC 29303"
[10] "1678 ROUTE 12 (Gales Ferry) , Gales Ferry, CT 06335"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line"><span class="comment"># remove paired ()</span></div><div class="line">address[str_detect(address$input_address, <span class="string">"\\(.*\\)"</span>), address_type := <span class="string">'a()'</span>]</div><div class="line">address[address_type == <span class="string">'a()'</span>, </div><div class="line"> input_address := str_replace_all(input_address, <span class="string">"\\(.*\\)"</span>, <span class="string">""</span>)]</div></pre></td></tr></table></figure>
<p>After this step, there are still some single ( cases.</p>
<pre><code>> sample(address[str_detect(input_address, "\\("), input_address], 10)
[1] "65 E Interstate 26 HWY , Columbus (Township o, NC 28722"
[2] "4496 SYCAMORE GROVE (4300-4799 RD , Chambersburg, PA 17201"
[3] "AAA RD , Fort Hood (U.S. Army, TX 76544"
[4] "2010 Catherine Lake RD , Richlands (Township, NC 28574"
[5] "285 Scott CIR NW , Calhoun (St. Address, GA 30701"
[6] "Highway 411 NE , Calhoun (St. Address, GA 30701"
[7] "2626 HILLTOP CT SW , Littlerock (RR name, WA 98556"
[8] "144 Tyler Ct. , Richland (Township o, PA 15904"
[9] "263 Farmington AVE , Farmington (Health C, CT 06030"
[10] "12957 Roberts RD , Hartford (Township o, OH 43013"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(input_address, <span class="string">"\\("</span>), address_type := <span class="string">'a('</span>]</div><div class="line"><span class="comment"># other than the case that ( in beginning, all content from ( to , to be removed.</span></div><div class="line">address[str_detect(input_address, <span class="string">"^\\("</span>), </div><div class="line"> input_address := str_replace(input_address, <span class="string">"^\\("</span>, <span class="string">""</span>)]</div><div class="line">address[str_detect(input_address, <span class="string">"\\("</span>), </div><div class="line"> input_address := str_replace(input_address, <span class="string">"\\(.*(,)"</span>, <span class="string">"\\1"</span>)]</div></pre></td></tr></table></figure>
<p>Some used ; to add additional information, which will only cause trouble for the Geocoder.</p>
<pre><code>> sample(address[str_detect(input_address, ";"), input_address], 10)
[1] "1816 MT WASHINGTON AV #1; WHIT , Colorado Springs, CO 80906"
[2] "3201 E PLATTE AV; WAL-MART STO , Colorado Springs, CO 00000"
[3] "1511 YUMA ST #2; CONOVER APART , Colorado Springs, CO 80909"
[4] "3550 AFTERNOON CR; MSGT ROY P , Colorado Springs, CO 80910"
[5] "805 S CIRCLE DR #B2; APOLLO PA , Colorado Springs, CO 00000"
[6] "5590 POWERS CENTER PT; SEVEN E , Colorado Springs, CO 80920"
[7] "715 CHEYENNE MEADOWS RD; DIAMO , Colorado Springs, CO 80906"
[8] "3140 VAN TEYLINGEN DR #A; SIER , Colorado Springs, CO 00000"
[9] "Meadow Rd; rifle clu , Hampden, OO 04444"
[10] "3301 E SKELLY DR;J , TULSA, OK 74105"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(input_address, <span class="string">";"</span>), address_type := <span class="string">'a;'</span>]</div><div class="line">address[address_type == <span class="string">'a;'</span>, </div><div class="line"> input_address := str_replace(input_address, <span class="string">";.*?(,)"</span>, <span class="string">"\\1"</span>)]</div></pre></td></tr></table></figure>
<p>Some have *.</p>
<pre><code>> sample(address[str_detect(address$input_address, "\\*") & address_type == 'a', input_address], 10)
[1] "TAYLOR ST , *Holyoke, MA 01040" "NORTHAMPTON ST , *Holyoke, MA 01040"
[3] "1*5* W Coral RD , Stanton, MI 48888" "Cr 727 *26 , angleton, TX 77515"
[5] "378 APPLETON ST , *Holyoke, MA 01040" "0 I195*I895 , ARBUTUS, MD 21227"
[7] "1504 NORTHAMPTON ST , *Holyoke, MA 01040" "50 RIVER TER , *Holyoke, MA 01040"
[9] "BOOKER ST * CARVER ST , Palatka, FL 32177" "19 OCONNOR AVE , *HOLYOKE, MA 01040"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(address$input_address, <span class="string">"\\*"</span>) & address_type == <span class="string">'a'</span>, address_type := <span class="string">'a*'</span>]</div><div class="line">address[address_type == <span class="string">'a*'</span>, input_address := str_replace_all(input_address, <span class="string">"\\*"</span>, <span class="string">""</span>)]</div></pre></td></tr></table></figure>
<p>This looks like came from some program output.</p>
<pre><code>> head(address[str_detect(address_type, "^a") & str_detect(input_address, "\\*"), input_address])
[1] "5280 Bruns RD , **UNDEFINED, CA 00000" "6500 Lindeman RD , **UNDEFINED, CA 00000"
[3] "5280 Bruns RD , **UNDEFINED, CA 00000" "17501 Sr 4 , **UNDEFINED, CA 00000"
[5] "5993 Bethel Island RD , **UNDEFINED, CA 00000" "1 Quail Hill LN , **UNDEFINED, CA 00000"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(address_type, <span class="string">"^a"</span>) & str_detect(input_address, <span class="string">"\\*"</span>),</div><div class="line"> input_address := str_replace(input_address, <span class="string">"\\*\\*UNDEFINED"</span>, <span class="string">""</span>)]</div></pre></td></tr></table></figure>
<p>Almost any special character that OK for human reading still cannot be handled by the Geocoder.</p>
<pre><code>> sample(address[str_detect(input_address, "^#"), input_address], 10)
[1] "# 6 HIGH , Marks, MS 38646" "#560 CR56 , MAPLECREST, NY 12454"
[3] "#250blk Durgintown rd. , Hiram, ME 04041" "#888 Durgintown Rd. , Hiram, ME 04041"
[5] "#15 LITTLE KANAWHA RIVER RD , PARKERSBURG, WV 26101" "# 12 HOLLOW RD , WELLSTON, OH 45692"
[7] "#10 I-24 , Paducah, KY 42003" "#10.5 mm St RD 264 , Yahtahey, NM 87375"
[9] "#1 CANAL RD , SENECA, IL 61360" "#08 N Ola DR , Yahtahey, NM 87375"
</code></pre><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">address[str_detect(input_address, <span class="string">"^#"</span>), address_type := <span class="string">'a#'</span>]</div><div class="line">address[address_type == <span class="string">'a#'</span>, input_address := str_replace_all(input_address, <span class="string">"^#"</span>, <span class="string">""</span>)]</div></pre></td></tr></table></figure>
<p>All these steps may look cumbersome. Actually I just check the Geocoding results on one year data raw input, find all the problems and errors, clean them by types. Then I apply same cleaning code to other years because they are very similar, and I got the Geocoding performance doubled! I think this cleaning is well worth the effort.</p>
<h2 id="version-history"><a href="#Version-History" class="headerlink" title="Version History"></a>Version History</h2><ul>
<li>2016-02-03 : First version.</li>
<li>2016-05-11 : Added Summary.</li>
</ul>
]]></content>
<summary type="html">
<h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>This is my second post on topic of Data Cleaning. </li>
<li>Cleaning addresses format turned out to have a substantial positive impact on Geocoding performance.</li>
<li>Deep understandings of address format standard is needed to deal with all kinds of special cases.</li>
</ul>
</summary>
<category term="R" scheme="https://dracodoc.github.io/categories/R/"/>
<category term="R" scheme="https://dracodoc.github.io/tags/R/"/>
<category term="Geocoding" scheme="https://dracodoc.github.io/tags/Geocoding/"/>
<category term="Data Science" scheme="https://dracodoc.github.io/tags/Data-Science/"/>
<category term="Data Cleaning" scheme="https://dracodoc.github.io/tags/Data-Cleaning/"/>
</entry>
<entry>
<title>Data Cleaning Part 1 - NYC Taxi Trip Data, Looking For Stories Behind Errors</title>
<link href="https://dracodoc.github.io/2016/01/31/data-cleaning/"/>
<id>https://dracodoc.github.io/2016/01/31/data-cleaning/</id>
<published>2016-02-01T01:27:06.000Z</published>
<updated>2016-08-19T13:47:21.585Z</updated>
<content type="html"><![CDATA[<h2 id="summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>Data cleaning is a cumbersome but important task for Data Science project in reality.</li>
<li>This is a discussion on my practice of data cleaning for NYC Taxi Trip data.</li>
<li>There are lots of domain knowledge, common sense and business thinking involved.</li>
</ul>
<a id="more"></a>
<h2 id="data-cleaning-the-unavoidable-time-consuming-cumbersome-nontrivial-task"><a href="#Data-Cleaning-the-unavoidable-time-consuming-cumbersome-nontrivial-task" class="headerlink" title="Data Cleaning, the unavoidable, time consuming, cumbersome nontrivial task"></a>Data Cleaning, the unavoidable, time consuming, cumbersome nontrivial task</h2><p>Data Science may sound fancy, but I saw many posts/blogs of data scientists complaining that much of their time were spending on data cleaning. From my own experience on several learning/volunteer projects, this step do require lots of time and much attention to details. However I often felt the abnormal or wrong data are actually more interesting. There must be some explanations behind the error, and that could be some interesting stories. Every time after I filtered some data with errors, I can have better understanding of the whole picture and estimate of the information content of the data set.</p>
<h3 id="nyc-taxi-trip-data"><a href="#NYC-Taxi-Trip-Data" class="headerlink" title="NYC Taxi Trip Data"></a>NYC Taxi Trip Data</h3><p>One good example is the <a href="http://chriswhong.com/open-data/foil_nyc_taxi/" target="_blank" rel="external">the NYC Taxi Trip Data</a>. </p>
<p><em>By the way, <a href="http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/" target="_blank" rel="external">this analysis and exploration</a> is pretty impressive. I think it’s partly because the author is NYC native and already have lots of possible pattern ideas in mind. For same reason I like to explore my local area of any national data to gain more understandings from the data. Besides, it turned out that you don’t even need a base map layer for the taxi pickup point map when you have enough data points. The pickup points themselves shaped all the streets and roads!</em></p>
<p>First I prepared and merged the two data file, trip data and trip fare.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line"><span class="keyword">library</span>(data.table)</div><div class="line"><span class="keyword">library</span>(stringr)</div><div class="line"><span class="keyword">library</span>(lubridate)</div><div class="line"><span class="keyword">library</span>(geosphere)</div><div class="line"><span class="keyword">library</span>(ggplot2)</div><div class="line"><span class="keyword">library</span>(ggmap)</div><div class="line"><span class="comment">## ------------------ read and check data ---------------------------------</span></div><div class="line">trip.data = fread(<span class="string">"trip_data_3.csv"</span>, sep = <span class="string">','</span>, header = <span class="literal">TRUE</span>, showProgress = <span class="literal">TRUE</span>)</div><div class="line">trip.fare = fread(<span class="string">"trip_fare_3.csv"</span>, sep = <span class="string">','</span>, header = <span class="literal">TRUE</span>, showProgress = <span class="literal">TRUE</span>)</div><div class="line">summary(trip.data)</div><div class="line">summary(trip.fare)</div><div class="line"><span class="comment">## ------------------ column format -----------------------</span></div><div class="line"><span class="comment">## all conversion were done on new copy first to make sure it was done right, </span></div><div class="line"><span class="comment">## then the original columns were overwrite in place to save memory</span></div><div class="line"><span class="comment"># remove leading space in column names from fread</span></div><div class="line">setnames(trip.data, str_trim(colnames(trip.data)))</div><div class="line">setnames(trip.fare, str_trim(colnames(trip.fare)))</div><div class="line"><span class="comment"># convert characters to factor to verify missing values, easier to observe </span></div><div class="line">trip.data[, medallion := as.factor(medallion)]</div><div class="line">trip.data[, hack_license := as.factor(hack_license)]</div><div class="line">trip.data[, vendor_id := as.factor(vendor_id)]</div><div class="line">trip.data[, store_and_fwd_flag := as.factor(store_and_fwd_flag)]</div><div class="line">trip.fare[, medallion := as.factor(medallion)]</div><div class="line">trip.fare[, hack_license := as.factor(hack_license)]</div><div class="line">trip.fare[, vendor_id := as.factor(vendor_id)]</div><div class="line">trip.fare[, payment_type := as.factor(payment_type)]</div><div class="line"><span class="comment"># date time conversion. </span></div><div class="line">trip.data[, pickup_datetime := fast_strptime(pickup_datetime,<span class="string">"%Y-%m-%d %H:%M:%S"</span>)]</div><div class="line">trip.data[, dropoff_datetime := fast_strptime(dropoff_datetime,<span class="string">"%Y-%m-%d %H:%M:%S"</span>)]</div><div class="line">trip.fare[, pickup_datetime := fast_strptime(pickup_datetime,<span class="string">"%Y-%m-%d %H:%M:%S"</span>)]</div><div class="line"><span class="comment">## ------------- join two data set by pickup_datetime, medallion, hack_license -------------</span></div><div class="line"><span class="comment"># after join by 3 columns, all vendor_id also matches: </span></div><div class="line"><span class="comment"># trip.all[vendor_id.x == vendor_id.y, .N] so add vendor_id to key too.</span></div><div class="line">setkey(trip.data, pickup_datetime, medallion, hack_license, vendor_id)</div><div class="line">setkey(trip.fare, pickup_datetime, medallion, hack_license, vendor_id)</div><div class="line"><span class="comment"># we can add transaction number to trip and fare so we can identify missed match more easily</span></div><div class="line">trip.data[, trip_no := .I]</div><div class="line">trip.fare[, fare_no := .I]</div><div class="line">trip.all = merge(trip.data, trip.fare, all = <span class="literal">TRUE</span>, suffixes = c(<span class="string">".x"</span>, <span class="string">".y"</span>))</div></pre></td></tr></table></figure>
<p>Then I found many obvious data errors.</p>
<h4 id="some-columns-have-obvious-wrong-values-like-zero-passenger-count"><a href="#Some-columns-have-obvious-wrong-values-like-zero-passenger-count" class="headerlink" title="Some columns have obvious wrong values, like zero passenger count."></a>Some columns have obvious wrong values, like zero passenger count.</h4><p><img src="zero_passenger.png" alt="zero passenger count"></p>
<p>Though the other columns look perfectly normal. As long as you are not using passenger count information, I think these rows are still valid.</p>
<h4 id="another-interesting-phenomenon-is-the-super-short-trip"><a href="#Another-interesting-phenomenon-is-the-super-short-trip" class="headerlink" title="Another interesting phenomenon is the super short trip:"></a>Another interesting phenomenon is the super short trip:</h4><figure class="highlight r"><table><tr><td class="code"><pre><div class="line">short = trip.all[trip_time_in_secs <<span class="number">10</span>][order(total_amount)] </div><div class="line">View(short)</div></pre></td></tr></table></figure>
<p><img src="short_trip.png" alt="short trip"></p>
<ul>
<li><p>One possible explanation I can imagine is that maybe some passengers get on taxi then get off immediately, so the time and distance is near zero and they paid the minimal fare of $2.5. Many rows do have zero for pickup or drop off location or almost same location for pick up and drop off.</p>
</li>
<li><p>Then how is the longer trip distance possible? Especially when most pick up and drop off coordinates are either zero or same location. Even if the taxi was stuck in traffic so there is no location change and trip distance recorded by the taximeter, the less than 10 seconds trip time still cannot be explained. </p>
</li>
</ul>
<p><img src="long_distance_in_short_time.png" alt="long distance in short time"></p>
<ul>
<li>There are also quite some big value trip fares for very short trips. Most of them have pick up and drop off coordinates at zero or at same locations.</li>
</ul>
<p><img src="fare_hist.png" alt="fare amount"></p>
<p><img src="big_fare_in_short_trip.png" alt="big fare in short trip"></p>
<p>I don’t have good explanations for these phenomenon and I don’t want to make too many assumptions since I’m not really familiar with NYC taxi trips. I guess a NYC local probably can give some insights on them, and we can verify them with data.</p>
<h4 id="average-driving-speed"><a href="#Average-driving-speed" class="headerlink" title="Average driving speed"></a>Average driving speed</h4><p>We can further verify the trip time/distance combination by checking the average driving speed. The near zero time or distance could cause too much variance in calculated driving speed. Considering the possible input error in time and distance, we can round up the time in seconds to minutes before calculating driving speed.</p>
<p>First check on the records that have very short time and nontrivial trip distance:</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">distance.conflict = trip.all[trip_time_in_secs < <span class="number">10</span> & trip_distance > <span class="number">0.5</span>][order(trip_distance)]</div></pre></td></tr></table></figure>
<p>If the pick up and drop off coordinates are not empty, we can calculate the great-circle distance between the coordinates. The actual trip distance must be equal or bigger than this distance.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">distance.conflict.with.gps = distance.conflict[pickup_longitude != <span class="number">0</span> & </div><div class="line"> pickup_latitude != <span class="number">0</span> & </div><div class="line"> dropoff_longitude != <span class="number">0</span> & </div><div class="line"> dropoff_latitude != <span class="number">0</span>]</div><div class="line">gps.mat = as.matrix(distance.conflict.with.gps[, </div><div class="line"> .(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude)])</div><div class="line">distance.conflict.with.gps[, dis.by.gps.meter := distHaversine(gps.mat[, <span class="number">1</span>:<span class="number">2</span>],gps.mat[, <span class="number">3</span>:<span class="number">4</span>])][order(dis.by.gps.meter)]</div><div class="line">distance.conflict.with.gps[, dis.by.gps.mile := dis.by.gps.meter * <span class="number">0.000621371</span>]</div></pre></td></tr></table></figure>
<p>If both the great-circle distance and trip distance are nontrivial, it’s more likely the less than 10 seconds trip time are wrong.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">wrong.time = distance.conflict.with.gps[dis.by.gps.mile >= <span class="number">0.5</span>]</div><div class="line">View(wrong.time[, .(trip_time_in_secs, trip_distance, fare_amount, dis.by.gps.mile)])</div></pre></td></tr></table></figure>
<p><img src="dis_by_gps.png" alt="distance by gps"></p>
<p>And there must be something wrong if the great-circle distance is much bigger than the trip distance. Note the data here is limited to the short trip time subset, but this type of error can happen in all records.</p>
<p><img src="more_great_circle_distance.png" alt="more great circle distance"></p>
<p>Either the taximeter had some errors in reporting trip distance, or the gps coordinates were wrong. Because all the trip time very short, I think it’s more likely to be the problem with gps coordinates. And the time and distance measurement should be much simpler and reliable than the gps coordinates measurement.</p>
<h4 id="gps-coordinates-distribution"><a href="#gps-coordinates-distribution" class="headerlink" title="gps coordinates distribution"></a>gps coordinates distribution</h4><p>We can further check the accuracy of the gps coordinates by matching with NYC boundary. The code below is a simplified method which take center of NYC area then add 100 miles in four directions as the boundary. More sophisticated way is to use a shapefile, but it will be much slower in checking data points. Since the taxi trip actually can have at least one end outside of NYC area, I don’t think we need to be too strict on NYC area boundary.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">trip.valid.gps = trip.all[pickup_longitude != <span class="number">0</span> & pickup_latitude != <span class="number">0</span> & </div><div class="line"> dropoff_longitude != <span class="number">0</span> & dropoff_latitude != <span class="number">0</span>]</div><div class="line">nyc.lat = <span class="number">40.719681</span> <span class="comment"># picked "center of NYC area" from google map</span></div><div class="line">nyc.lon = -<span class="number">74.00536</span></div><div class="line">nyc.lat.max = nyc.lat + <span class="number">100</span>/<span class="number">69</span></div><div class="line">nyc.lat.min = nyc.lat - <span class="number">100</span>/<span class="number">69</span></div><div class="line">nyc.lon.max = nyc.lon + <span class="number">100</span>/<span class="number">52</span></div><div class="line">nyc.lon.min = nyc.lon - <span class="number">100</span>/<span class="number">52</span></div><div class="line">trip.valid.gps.nyc = trip.valid.gps[nyc.lon.max > pickup_longitude & pickup_longitude > nyc.lon.min &</div><div class="line"> nyc.lon.max > dropoff_longitude & dropoff_longitude > nyc.lon.min &</div><div class="line"> nyc.lat.max > pickup_latitude & pickup_latitude > nyc.lat.min &</div><div class="line"> nyc.lat.max > dropoff_latitude & dropoff_latitude > nyc.lat.min]</div><div class="line">View(trip.valid.gps[!trip.valid.gps.nyc][order(trip_distance)])</div><div class="line">mat.nyc = as.matrix(trip.valid.gps.nyc[, .(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude)])</div><div class="line">dis = distHaversine(mat.nyc[, <span class="number">1</span>:<span class="number">2</span>],mat.nyc[, <span class="number">3</span>:<span class="number">4</span>]) / <span class="number">1639.344</span></div><div class="line">trip.valid.gps.nyc[, dis.by.gps := dis]</div></pre></td></tr></table></figure>
<p><img src="off_gps_coordinates.png" alt="off gps coordinates"></p>
<p>I found another verification on gps coordinates when I was checking the trips started from the JFK airport. Note I used two reference points in JFK airport to better filter all the trips that originated from inside the airport and the immediate neighborhood of JFK exit.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line"><span class="comment"># the official loc of JFK is too far on east, we choose 2 point to better represent possible pickup areas.</span></div><div class="line">jfk.inside = data.frame(lon = -<span class="number">73.783074</span>, lat = <span class="number">40.64561</span>)</div><div class="line">jfk.exit = data.frame(lon = -<span class="number">73.798523</span>, lat = <span class="number">40.658439</span>)</div><div class="line">jfk.map = get_map(location = unlist(jfk.inside), zoom = <span class="number">13</span>, maptype = <span class="string">'roadmap'</span>)</div><div class="line"><span class="comment"># rides from JFK could end at out of NYC, but there are too many obvious wrong gps information in that part of data, we will just use the data that have gps location in NYC area this time. This area is actually rather big, a square area with 200 miles edge.</span></div><div class="line">trip.valid.gps.nyc[, dis.jfk.center.meter := distHaversine(mat.nyc[, <span class="number">1</span>:<span class="number">2</span>], jfk.inside)]</div><div class="line">trip.valid.gps.nyc[, dis.jfk.exit.meter := distHaversine(mat.nyc[, <span class="number">1</span>:<span class="number">2</span>], jfk.exit)]</div><div class="line"><span class="comment"># the actual distance threshold is adjusted by visual checking the map below, so that it includes most rides picked up from JFK, and excludes rides in neighborhood but not from JFK.</span></div><div class="line">near.jfk = trip.valid.gps.nyc[dis.jfk.center.meter < <span class="number">2500</span> | dis.jfk.exit.meter < <span class="number">1200</span>]</div><div class="line">ggmap(jfk.map) +geom_point(data = rbind(jfk.inside, jfk.exit), aes(x = lon, y = lat)) + geom_point(data = near.jfk, aes(x = pickup_longitude, y = pickup_latitude, colour = <span class="string">'red'</span>))</div><div class="line"></div></pre></td></tr></table></figure>
<p><img src="JFK_trip.png" alt="JFK trip"></p>
<p>Interestingly, there are some pick up points in the airplane runway or the bay. These are obvious errors, actually I think gps coorindates report in big city could have all kinds of error.</p>
<h4 id="superman-taxi-driver"><a href="#Superman-taxi-driver" class="headerlink" title="Superman taxi driver"></a>Superman taxi driver</h4><p>I also found some interesting records in checking taxi driver revenue.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">trip.march = trip.all[month(dropoff_datetime) == <span class="number">3</span>]</div><div class="line">revenue = trip.march[, .(revenue.march = sum(total_amount)), by = hack_license] </div><div class="line">summary(revenue$revenue.march)</div></pre></td></tr></table></figure>
<figure class="highlight plain"><table><tr><td class="code"><pre><div class="line">Min. 1st Qu. Median Mean 3rd Qu. Max. </div><div class="line"> 2.6 4955.0 7220.0 6871.0 9032.0 43770.0</div></pre></td></tr></table></figure>
<p>Who are these superman taxi driver that earned significantly more?</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">tail(revenue[order(revenue.march)])</div></pre></td></tr></table></figure>
<figure class="highlight plain"><table><tr><td class="code"><pre><div class="line"> hack_license revenue.march</div><div class="line">1: 3AAB94CA53FE93A64811F65690654649 21437.62</div><div class="line">2: 74CC809D28AE726DDB32249C044DA4F8 22113.14</div><div class="line">3: F153D0336BF48F93EC3913548164DDBD 22744.56</div><div class="line">4: D85749E8852FCC66A990E40605607B2F 23171.50</div><div class="line">5: 847349F8845A667D9AC7CDEDD1C873CB 23366.48</div><div class="line">6: CFCD208495D565EF66E7DFF9F98764DA 43771.85</div></pre></td></tr></table></figure>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">View(trip.all[hack_license == <span class="string">'CFCD208495D565EF66E7DFF9F98764DA'</span>])</div></pre></td></tr></table></figure>
<p><img src="superman_driver.png" alt="superman driver"></p>
<p>So this driver were using different medallion with same hack license, picked up 1412 rides in March, some rides even started before last end(No.17, 18, 22 etc). The simplest explanation is that these records are not from one single driver.</p>
<figure class="highlight r"><table><tr><td class="code"><pre><div class="line"></div><div class="line">rides = trip.march[, .N, by = hack_license]</div><div class="line">summary(rides)</div><div class="line">tail(rides[order(N)]) </div></pre></td></tr></table></figure>
<figure class="highlight plain"><table><tr><td class="code"><pre><div class="line"> hack_license N</div><div class="line">1: 74CC809D28AE726DDB32249C044DA4F8 1514</div><div class="line">2: 51C1BE97280A80EBFA8DAD34E1956CF6 1530</div><div class="line">3: 5C19018ED8557E5400F191D531411D89 1575</div><div class="line">4: 847349F8845A667D9AC7CDEDD1C873CB 1602</div><div class="line">5: F49FD0D84449AE7F72F3BC492CD6C754 1638</div><div class="line">6: D85749E8852FCC66A990E40605607B2F 1649</div></pre></td></tr></table></figure>
<p>These hack license owner picked up more than 1500 rides in March, that’s 50 per day. </p>
<p>We can further check if there is any time overlap between drop off and next pickup, or if the pick up location was too far from last drop off location, but I think there is no need to do that before I have better theory.</p>
<h2 id="summary"><a href="#Summary-1" class="headerlink" title="Summary"></a>Summary</h2><p>In this case I didn’t dig too much yet because I’m not really familiar with NYC taxi, but there are lots of interesting phenomenons already. We can know a lot about the quality of certain data fields from these errors.</p>
<p>In my other project, data cleaning is not just about digging interesting stories. It actually helped with the data process a lot. See more details in my <a href="http://dracodoc.github.io/2016/02/03/data-cleaning-geocode/">next post</a>.</p>
<h2 id="version-history"><a href="#Version-History" class="headerlink" title="Version History"></a>Version History</h2><ul>
<li>2016-01-31 : First version.</li>
<li>2016-05-11 : Added Summary.</li>
</ul>
]]></content>
<summary type="html">
<h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>Data cleaning is a cumbersome but important task for Data Science project in reality.</li>
<li>This is a discussion on my practice of data cleaning for NYC Taxi Trip data.</li>
<li>There are lots of domain knowledge, common sense and business thinking involved.</li>
</ul>
</summary>
<category term="R" scheme="https://dracodoc.github.io/categories/R/"/>
<category term="R" scheme="https://dracodoc.github.io/tags/R/"/>
<category term="Data Science" scheme="https://dracodoc.github.io/tags/Data-Science/"/>
<category term="Data Cleaning" scheme="https://dracodoc.github.io/tags/Data-Cleaning/"/>
<category term="NYC taxi data" scheme="https://dracodoc.github.io/tags/NYC-taxi-data/"/>
</entry>
<entry>
<title>Script And Workflow For Batch Geocoding Millions Of Address With PostGIS Tiger Geocoder</title>
<link href="https://dracodoc.github.io/2015/11/19/Script-workflow/"/>
<id>https://dracodoc.github.io/2015/11/19/Script-workflow/</id>
<published>2015-11-19T20:05:00.000Z</published>
<updated>2016-08-19T19:39:10.803Z</updated>
<content type="html"><![CDATA[<h2 id="summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>I discussed all the problem I met, approaches I tried, and improvement I achieved in the Geocoding task.</li>
<li>There are many subtle details, some open questions and areas can be improved.</li>
<li>The final working script and complete workflow are hosted in <a href="https://github.com/dracodoc/Geocode" target="_blank" rel="external">github</a>.</li>
</ul>
<a id="more"></a>
<h2 id="introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p>This is the detailed discussion of my script and workflow for geocoding NFIRS data. See <a href="http://dracodoc.github.io/2015/11/11/Red-Cross-Smoke-Alarm-Project/">background of project</a> and <a href="http://dracodoc.github.io/2015/11/17/Geocoding/">the system setup</a> in my previous posts.</p>
<p>So I have 18 million addresses like this, how can I geocode them into valid address, coordinates and map to census block?<br><img src="NFIRS_data_sample.png" alt="NFIRS data sample"></p>
<h2 id="tiger-geocoder-geocode-function"><a href="#Tiger-Geocoder-Geocode-Function" class="headerlink" title="Tiger Geocoder Geocode Function"></a>Tiger Geocoder Geocode Function</h2><p>Tiger Geocoder extension have this <a href="http://postgis.net/docs/Geocode.html" target="_blank" rel="external"><code>geocode</code> function</a> to take in address string then output a set of possible locations and coordinates. A perfect formated accurate address could have an exact match in 61ms, but if there are misspelling or other non-perfect input, it could take much longer time.</p>
<p>Since geocoding performance varies a lot depending on the case and I have 18 millions address to geocode, I need to take every possible measure to improve the performance and finish the task with less hours. I searched numerous discussions about improving performance and tried most of them.</p>
<h2 id="preparing-addresses"><a href="#Preparing-Addresses" class="headerlink" title="Preparing Addresses"></a>Preparing Addresses</h2><p>First I need to prepare my address input. Technically NFIRS data have a column of <code>Location Type</code> to separate street addresses, intersections and other type of input. I filtered the addresses with the street address type then further removed many rows that obviously are still intersections.</p>
<p>NFIRS designed many columns for different part of an address, like street prefix, suffix, apt number etc. I concatenate them into a string formated to meet the <code>geocode</code> function expectation. <strong>A good format with proper comma separation could make the geocode function’s work much easier</strong>. One bonus of concatenating the address segments is that some misplaced input columns will be corrected, for example some rows have the street number in street name column.</p>
<p>There are still numerous input errors, but I didn’t plan to clean up too much first. Because I don’t know what will cause problems before actually running the geocoding process . It will be probably easier to run one pass for one year’s data first, then collect all the formatting errors, clean them up and feed them for second pass. After this round I can use the clean up procedures to process other years’ data before geocoding.</p>
<p>Another tip I found about improving geocoding performance is to <strong>process one state at a time, maybe sort the address by zipcode</strong>. Because I want the postgresql server to cache everything needed for geocoding in RAM and avoid disk access as much as possible. With limited RAM it’s better to only process similar address at a time. Split huge data file into smaller tasks also make it easier to find problem or deal with exceptions, of course you will need a good batch processing workflow to process more input files.</p>
<p>Someone also mentioned that to <strong>standardize the address first, remove the invalid addresses</strong> since they take the most time to geocode. However I’m not sure how can I verify the valid address without actual geocoding. Some addresses are obviously missing street numbers and cannot have an exact location, but I may still need the ballpark location for my analysis. They may not be able to be mapped to census block, but a census tract mapping could still be helpful. After the first pass on one year’s data I will design a much more complete cleaning process, which could make the geocoding function’s job a little bit easier.</p>
<p><a href="http://postgis.net/docs/postgis_installation.html#tiger_pagc_address_standardizing" target="_blank" rel="external">The PostGIS documentation</a> did mention that the built-in address normalizer is not optimal and they have a better pagc address standardizer can be used. I tried to enable it in the linux setup but failed. It seemed that I need to reinstall postgresql since it is not included in the postgresql setup process of the ansible playbook. The newer version PostGIS 2.2.0 released in Oct, 2015 seemed to have <em>“New high-speed native code address standardizer”</em>, while the ansible playbook used <code>PostgreSQL 9.3.10</code> and <code>PostGIS 2.1.2 r12389</code>. This is a direction I’ll explore later.</p>
<h2 id="test-geocoding-function"><a href="#Test-Geocoding-Function" class="headerlink" title="Test Geocoding Function"></a>Test Geocoding Function</h2><p>Based on the example given in <code>geocode</code> function documentation, I wrote my version of SQL command to geocode address like this:</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> g.rating,</div><div class="line"> pprint_addy(g.addy),</div><div class="line"> ST_X(g.geomout)::<span class="built_in">numeric</span>(<span class="number">8</span>,<span class="number">5</span>) <span class="keyword">AS</span> lon,</div><div class="line"> ST_Y(g.geomout)::<span class="built_in">numeric</span>(<span class="number">8</span>,<span class="number">5</span>) <span class="keyword">AS</span> lat,</div><div class="line"> g.geomout</div><div class="line"><span class="keyword">FROM</span> geocode(<span class="string">'2198 Florida Ave NW, Washington, DC 20008'</span>, <span class="number">1</span>) <span class="keyword">AS</span> g;</div></pre></td></tr></table></figure>
<ul>
<li>the <code>1</code> parameter in geocode function limit the output to single address with best rating, since we don’t have any other method to compare all the output.</li>
<li>rating is needed because I need to know the match score for result. 0 is for perfect match, and 100 is for very rough match which I probably will not use.</li>
<li><code>pprint_addy</code> give a pretty print of address in format that people familiar.</li>
<li><code>geomout</code> is the point geometry of the match. I want to save this because it is a more precise representation and I may need it for census block mapping.</li>
<li><code>lon</code> and <code>lat</code> are the coordinates round up to 5 digits after dot. The 6th digit will be in range of 1 m. Since most street address locations are interpolated and can be off a lot, there is no point to keep more digits.</li>
</ul>
<p>The next step is to make it work for many rows instead of just single input. I formated the addresses in R and wrote to csv file with this format:</p>
<table>
<thead>
<tr>
<th style="text-align:left">row_seq</th>
<th style="text-align:left">input_address</th>
<th style="text-align:center">zip</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">42203</td>
<td style="text-align:left">7365 RACE RD , HARMENS, MD 00000</td>
<td style="text-align:center">00000</td>
</tr>
<tr>
<td style="text-align:left">53948</td>
<td style="text-align:left">37 Parking Ramp , Washington, DC 20001</td>
<td style="text-align:center">20001</td>
</tr>
<tr>
<td style="text-align:left">229</td>
<td style="text-align:left">1315 5TH ST NW , WASHINGTON, DC 20001</td>
<td style="text-align:center">20001</td>
</tr>
<tr>
<td style="text-align:left">688</td>
<td style="text-align:left">1014 11TH ST NE , WASHINGTON, DC 20001</td>
<td style="text-align:center">20001</td>
</tr>
<tr>
<td style="text-align:left">2599</td>
<td style="text-align:left">100 RANDOLPH PL NW , WASHINGTON, DC 20001</td>
<td style="text-align:center">20001</td>
</tr>
</tbody>
</table>
<p>The <code>row_seq</code> is the unique id I assigned to every row so I can link the output back to the original table. <code>zip</code> is needed because I want to sort the addresses by zipcode. Another bonus is that addresses with obvious wrong zipcode will be shown together in beginning or ending of the file. I used the pipe symbol <code>|</code> as the separator of csv because there could be quotes and commas in columns.</p>
<p>Then I can read the csv into a table in postgresql database. The <code>geocode</code> function documentation provided an example to geocode addresses in batch mode, and most discussions in web seemed to be based on this example.</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="comment">-- only update the first 3 addresses (323-704 ms</span></div><div class="line"><span class="comment">-- there are caching and shared memory effects so first geocode you do is always slower)</span></div><div class="line"><span class="comment">-- for large numbers of addresses you don't want to update all at once</span></div><div class="line"><span class="comment">-- since the whole geocode must commit at once</span></div><div class="line"><span class="comment">-- For this example we rejoin with LEFT JOIN</span></div><div class="line"><span class="comment">-- and set to rating to -1 rating if no match</span></div><div class="line"><span class="comment">-- to ensure we don't regeocode a bad address</span></div><div class="line"><span class="keyword">UPDATE</span> addresses_to_geocode</div><div class="line"> <span class="keyword">SET</span> ( rating, new_address, lon, lat)</div><div class="line"> = ( <span class="keyword">COALESCE</span>((g.geo).rating,<span class="number">-1</span>), pprint_addy((g.geo).addy),</div><div class="line"> ST_X((g.geo).geomout)::<span class="built_in">numeric</span>(<span class="number">8</span>,<span class="number">5</span>), ST_Y((g.geo).geomout)::<span class="built_in">numeric</span>(<span class="number">8</span>,<span class="number">5</span>) )</div><div class="line"><span class="keyword">FROM</span> (<span class="keyword">SELECT</span> addid</div><div class="line"> <span class="keyword">FROM</span> addresses_to_geocode</div><div class="line"> <span class="keyword">WHERE</span> rating <span class="keyword">IS</span> <span class="literal">NULL</span> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> <span class="number">3</span>) <span class="keyword">AS</span> a</div><div class="line"> <span class="keyword">LEFT</span> <span class="keyword">JOIN</span> (<span class="keyword">SELECT</span> addid, (geocode(address,<span class="number">1</span>)) <span class="keyword">AS</span> geo</div><div class="line"> <span class="keyword">FROM</span> addresses_to_geocode <span class="keyword">AS</span> ag</div><div class="line"> <span class="keyword">WHERE</span> ag.rating <span class="keyword">IS</span> <span class="literal">NULL</span> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> <span class="number">3</span>) <span class="keyword">AS</span> g <span class="keyword">ON</span> a.addid = g.addid</div><div class="line"><span class="keyword">WHERE</span> a.addid = addresses_to_geocode.addid;</div></pre></td></tr></table></figure>
<p>Since the geocoding process can be slow, it’s suggested to process a small portion at a time. The address table was assigned an <code>addid</code> for each row as a index. The code always take <em>the first 3 rows not yet processed (rating column is null)</em> as the <em>sample</em> <code>a</code> to be geocoded.</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> addid</div><div class="line"> <span class="keyword">FROM</span> addresses_to_geocode</div><div class="line"> <span class="keyword">WHERE</span> rating <span class="keyword">IS</span> <span class="literal">NULL</span> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> <span class="number">3</span>) <span class="keyword">AS</span> a</div></pre></td></tr></table></figure>
<p><img src="table1.png" alt="table 1"><br>The <em>result of geocoding</em> <code>g</code> is joined with the <code>addid</code> of the <em>sample</em> <code>a</code>.</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line">LEFT JOIN (<span class="keyword">SELECT</span> addid, (geocode(address,<span class="number">1</span>)) <span class="keyword">AS</span> geo</div><div class="line"> <span class="keyword">FROM</span> addresses_to_geocode <span class="keyword">AS</span> ag</div><div class="line"> <span class="keyword">WHERE</span> ag.rating <span class="keyword">IS</span> <span class="literal">NULL</span> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> <span class="number">3</span></div><div class="line"> ) <span class="keyword">AS</span> g <span class="keyword">ON</span> a.addid = g.addid</div></pre></td></tr></table></figure>
<p><img src="table2.png" alt="table 2"></p>
<p>Then the <code>address table</code> was joined with <em>that joined table a-g</em> by <code>addid</code> and corresponding columns were updated. </p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">UPDATE</span> addresses_to_geocode</div><div class="line"> <span class="keyword">SET</span> ( rating, new_address, lon, lat)</div><div class="line"> = ( <span class="keyword">COALESCE</span>((g.geo).rating,<span class="number">-1</span>), pprint_addy((g.geo).addy),</div><div class="line"> ST_X((g.geo).geomout)::<span class="built_in">numeric</span>(<span class="number">8</span>,<span class="number">5</span>), ST_Y((g.geo).geomout)::<span class="built_in">numeric</span>(<span class="number">8</span>,<span class="number">5</span>) )</div><div class="line"><span class="keyword">FROM</span> </div><div class="line">...</div><div class="line"><span class="keyword">WHERE</span> a.addid = addresses_to_geocode.addid;</div></pre></td></tr></table></figure>
<p>The initial value of rating column is <code>NULL</code>. Valid geocoding match have a rating number range from 0 to around 100. Some input don’t have valid <code>geocode</code> function return value, which make the rating column to be <code>NULL</code>. Then it was replaced with <code>-1</code> by the <code>COALESCE</code> function to be separated with the unprocessed rows, so that the next run can skip them. </p>
<p>The join of <code>a</code> and <code>g</code> may seem redundant at first since <code>g</code> already included the <code>addid</code> column. However when some rows has no match and no value is returned by <code>geocode</code> function, <code>g</code> will only have rows with return values.<br><img src="table3.png" alt="table 3"><br>Joining <code>g</code> with address table will only update these rows by <code>addid</code>. <code>COALESCE</code> function will not take any effect since the empty row <code>addid</code> were not even included. Then the next run will select them again because they still satisfied the sample selection condition, which will mess up the control logic.</p>
<p>Instead joining <code>a</code> and <code>g</code> will have all <code>addid</code> in sample, and the no match rows have <code>NULL</code> in rating column.<br><img src="table4.png" alt="table 4"><br>The next joining with address table will have the rating column updated correctly by <code>COALESCE</code> function.<br><img src="table5.png" alt="table 5"></p>
<p>This programming pattern is new for me. I think it’s because SQL don’t have the fine grade control of the regular procedure languages, but we still need more control some times so we have this.</p>
<h2 id="problem-with-ill-formated-address"><a href="#Problem-With-Ill-Formated-Address" class="headerlink" title="Problem With Ill Formated Address"></a>Problem With Ill Formated Address</h2><p>In my experiment with test data I found the example code above often had serious performance problems. It was very similar to another problem I observed: if I run this line with different table sizes, it should have similar performance since it is supposed to only process the first 3 rows.</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> geocode(address_string,<span class="number">1</span>) </div><div class="line"> <span class="keyword">FROM</span> address_sample <span class="keyword">LIMIT</span> <span class="number">3</span>;</div></pre></td></tr></table></figure>
<p>Actually it took much, much longer on a larger table. It seemed that it was geocoding the whole table first, then only return the first 3 rows. If I subset the table more explicitly this problem disappeared:</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> geocode(sample.address_string, <span class="number">1</span>) </div><div class="line"> <span class="keyword">FROM</span> (<span class="keyword">SELECT</span> address_string </div><div class="line"> <span class="keyword">FROM</span> address_sample <span class="keyword">LIMIT</span> <span class="number">3</span></div><div class="line"> ) <span class="keyword">as</span> <span class="keyword">sample</span>;</div></pre></td></tr></table></figure>
<p>I modified the example code similarly. Instead of using <code>LIMIT</code> directly in the <code>WHERE</code> clause, </p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> addid, (geocode(address,<span class="number">1</span>)) <span class="keyword">AS</span> geo</div><div class="line"> <span class="keyword">FROM</span> addresses_to_geocode <span class="keyword">AS</span> ag</div><div class="line"> <span class="keyword">WHERE</span> ag.rating <span class="keyword">IS</span> <span class="literal">NULL</span> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> <span class="number">3</span></div></pre></td></tr></table></figure>
<p>I explicitly select the sample rows then put it in the <code>FROM</code> clause, problem solved.</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> sample.addid, geocode(sample.input_address,<span class="number">1</span>) <span class="keyword">AS</span> geo</div><div class="line"> <span class="keyword">FROM</span> (<span class="keyword">SELECT</span> addid, input_address</div><div class="line"> <span class="keyword">FROM</span> address_table <span class="keyword">WHERE</span> rating <span class="keyword">IS</span> <span class="literal">NULL</span></div><div class="line"> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> sample_size</div><div class="line"> ) <span class="keyword">AS</span> <span class="keyword">sample</span></div></pre></td></tr></table></figure>
<p>Later I found this problem only occurs when the first row of table have invalid address for which the geocode function have no return value. These are the <code>explain analysis</code> results from pgAdmin SQL query tool:</p>
<p>The example code runs on 100 row table on first time, with first row address invalid. The first step of <code>Seq Scan</code> take 284 s (this was on my home pc server running on regular hard drive with all states data, so the performance was bad) to return 99 rows of geocoding result(one row has no match).</p>
<p><img src="1_explain_scan_v1.png" alt="1. example code returned 99 rows in seq scan "></p>
<p><img src="2_explain_limit_v1.png" alt="2. example code limited results to 3 rows later"></p>
<p>While my modified version only processed 3 rows in first step.<br><img src="3_explain_scan_v2.png" alt="3. modified version geocoded 3 rows only"></p>
<p>After the first row has been processed and marked with <code>-1</code> in rating, the example code no longer have the problem<br><img src="4_explain_scan_v1_2nd_run.png" alt="4. example code no longer have problem with valid inputs"></p>
<p>If I move the problematic row to the second row, there was no problem either. It seemed that the postgresql planner had some trouble only when the first row didn’t have valid return value. The <code>geocode</code> function authors didn’t find this bug probably because this is a special case, but it’s very common in my data. Because I sorted the addresses by zipcode, many ill formated addresses with invalid zipcode always appear in the beginning of the file.</p>
<h2 id="making-a-full-script"><a href="#Making-A-full-Script" class="headerlink" title="Making A full Script"></a>Making A full Script</h2><p>To have a better control of the whole process, I need some <a href="http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html" target="_blank" rel="external">control structures</a> from PL/pgSQL - sql procedural Language.</p>
<p>First I make the geocoding code as a <code>geocode_sample</code> function with the sample size for each run as parameter.</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">CREATE</span> <span class="keyword">OR</span> <span class="keyword">REPLACE</span> <span class="keyword">FUNCTION</span> geocode_sample(sample_size <span class="built_in">integer</span>) </div><div class="line"> <span class="keyword">RETURNS</span> <span class="built_in">void</span> <span class="keyword">AS</span> $$</div><div class="line"><span class="keyword">BEGIN</span></div><div class="line">...</div><div class="line"><span class="keyword">END</span>;</div><div class="line">$$ LANGUAGE plpgsql;</div></pre></td></tr></table></figure>
<p><code>Create or replace</code> make debugging and making changes easier because new version will replace existing version.</p>
<p>Then this main control function <code>geocode_table</code> will calculate the number of rows for whole table, decide how many sample runs it needed to update the whole table, then run the <code>geocode_sample</code> function in a loop with that number. I don’t want to use a conditional loop because if there is something wrong, the code could stuck at some point and have a endless loop. I’d rather just run the code with calculated times then check the table to make sure all rows are processed correctly.</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">DROP</span> <span class="keyword">FUNCTION</span> <span class="keyword">IF</span> <span class="keyword">EXISTS</span> geocode_table();</div><div class="line"><span class="keyword">CREATE</span> <span class="keyword">OR</span> <span class="keyword">REPLACE</span> <span class="keyword">FUNCTION</span> geocode_table(</div><div class="line"> <span class="keyword">OUT</span> table_size <span class="built_in">integer</span>,</div><div class="line"> <span class="keyword">OUT</span> remaining_rows <span class="built_in">integer</span>) <span class="keyword">AS</span> $func$</div><div class="line"><span class="keyword">DECLARE</span> sample_size <span class="built_in">integer</span>;</div><div class="line"><span class="keyword">BEGIN</span></div><div class="line"> <span class="keyword">SELECT</span> reltuples::<span class="built_in">bigint</span> <span class="keyword">INTO</span> table_size</div><div class="line"> <span class="keyword">FROM</span> pg_class</div><div class="line"> <span class="keyword">WHERE</span> <span class="keyword">oid</span> = <span class="string">'public.address_table'</span>::regclass;</div><div class="line"> sample_size := 1;</div><div class="line"> FOR i IN 1..(<span class="keyword">SELECT</span> table_size / sample_size + <span class="number">1</span>) <span class="keyword">LOOP</span></div><div class="line"> PERFORM geocode_sample(sample_size);</div><div class="line"> <span class="keyword">END</span> <span class="keyword">LOOP</span>;</div><div class="line"> <span class="keyword">SELECT</span> <span class="keyword">count</span>(*) <span class="keyword">INTO</span> remaining_rows </div><div class="line"> <span class="keyword">FROM</span> address_table <span class="keyword">WHERE</span> rating <span class="keyword">IS</span> <span class="literal">NULL</span>;</div><div class="line"><span class="keyword">END</span></div><div class="line">$func$ <span class="keyword">LANGUAGE</span> plpgsql;</div></pre></td></tr></table></figure>
<ol>
<li>I used <code>drop function if exists</code> here because the <code>Create or replace</code> doesn’t work if the function return type was changed.</li>
<li>It’s widely acknowledged that calculating row count for a table by <code>count(*)</code> is not optimal. The method I used should be much quicker if the table statistics is up to date. I used to put a line of <code>VACUUM ANALYZE</code> after the table was constructed and csv data was imported, but in every run it reported that no update was needed. It probably because the default postgresql settings made sure the information is up to date right for my case.</li>
<li>In the end I counted the rows not processed yet. The total row number and the remaining row number will be the return value of this function.</li>
</ol>
<p>The whole PL/pgSQL script is structured like this (<em>actual details inside functions are omitted to have a clear view of whole picture. See complete scripts and everything else in <a href="https://github.com/dracodoc/Geocode" target="_blank" rel="external">my github repo</a></em>):</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">DROP</span> <span class="keyword">TABLE</span> <span class="keyword">IF</span> <span class="keyword">EXISTS</span> address_table;</div><div class="line"><span class="keyword">CREATE</span> <span class="keyword">TABLE</span> address_table(</div><div class="line"> row_seq <span class="built_in">varchar</span>(<span class="number">255</span>),</div><div class="line"> input_address <span class="built_in">varchar</span>(<span class="number">255</span>),</div><div class="line"> zip <span class="built_in">varchar</span>(<span class="number">255</span>) </div><div class="line">);</div><div class="line"><span class="comment">-- aws version.</span></div><div class="line">COPY address_table FROM :input_file WITH DELIMITER '|' NULL 'NA' CSV HEADER;</div><div class="line"><span class="comment">-- pc version.</span></div><div class="line"><span class="comment">-- COPY address_table FROM 'e:\\Data\\1.csv' WITH DELIMITER ',' NULL 'NA' CSV HEADER;</span></div><div class="line"></div><div class="line"><span class="keyword">ALTER</span> <span class="keyword">TABLE</span> address_table</div><div class="line"> <span class="keyword">ADD</span> addid <span class="built_in">serial</span> <span class="keyword">NOT</span> <span class="literal">NULL</span> PRIMARY <span class="keyword">KEY</span>,</div><div class="line"> <span class="keyword">ADD</span> rating <span class="built_in">integer</span>, </div><div class="line"> <span class="keyword">ADD</span> lon <span class="built_in">numeric</span>,</div><div class="line"> <span class="keyword">ADD</span> lat <span class="built_in">numeric</span>,</div><div class="line"> <span class="keyword">ADD</span> output_address <span class="built_in">text</span>,</div><div class="line"> <span class="keyword">ADD</span> geomout geometry, <span class="comment">-- a point geometry in NAD 83 long lat.</span></div><div class="line"></div><div class="line"><span class="comment">--<< geocode function --</span></div><div class="line"><span class="keyword">CREATE</span> <span class="keyword">OR</span> <span class="keyword">REPLACE</span> <span class="keyword">FUNCTION</span> geocode_sample(sample_size <span class="built_in">integer</span>) </div><div class="line"> <span class="keyword">RETURNS</span> <span class="built_in">void</span> <span class="keyword">AS</span> $$</div><div class="line">...</div><div class="line"><span class="keyword">END</span>;</div><div class="line">$$ LANGUAGE plpgsql;</div><div class="line"><span class="comment">-- geocode function >>--</span></div><div class="line"></div><div class="line"><span class="comment">--<< main control --</span></div><div class="line"><span class="keyword">DROP</span> <span class="keyword">FUNCTION</span> <span class="keyword">IF</span> <span class="keyword">EXISTS</span> geocode_table();</div><div class="line"><span class="keyword">CREATE</span> <span class="keyword">OR</span> <span class="keyword">REPLACE</span> <span class="keyword">FUNCTION</span> geocode_table(</div><div class="line"> <span class="keyword">OUT</span> table_size <span class="built_in">integer</span>,</div><div class="line"> <span class="keyword">OUT</span> remaining_rows <span class="built_in">integer</span>) <span class="keyword">AS</span> $func$</div><div class="line">...</div><div class="line"><span class="keyword">END</span></div><div class="line">$func$ <span class="keyword">LANGUAGE</span> plpgsql;</div><div class="line"><span class="comment">-- main control >>--</span></div><div class="line"></div><div class="line"><span class="keyword">SELECT</span> * <span class="keyword">FROM</span> geocode_table();</div></pre></td></tr></table></figure>
<ol>
<li>First I dropped the address table if previously exists, created the table with columns in characters type because I don’t want the leading zero in zipcode lost in converting to integer.</li>
<li>I have two version of importing csv into table, one for testing in windows pc, another one for AWS linux instance. The SQL <code>copy</code> command need the postgresql server user to have permission for the input file, so you need to make sure the folder permission is correct. The linux version used a parameter for input file path.</li>
<li>Then the necessary columns were added to table and the index was built.</li>
<li>The last line run the main control function and print the return value of it in the end, which is the total row number and remaining row number of input table.</li>
</ol>
<h2 id="intersection-address"><a href="#Intersection-address" class="headerlink" title="Intersection address"></a>Intersection address</h2><p>Another type of input is intersections. Tiger Geocoder have a function <a href="http://postgis.net/docs/Geocode_Intersection.html" target="_blank" rel="external"><code>Geocode_Intersection</code></a> work like this:</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> pprint_addy(addy), st_astext(geomout), rating</div><div class="line"> <span class="keyword">FROM</span> geocode_intersection( <span class="string">'Haverford St'</span>,<span class="string">'Germania St'</span>, <span class="string">'MA'</span>, <span class="string">'Boston'</span>, <span class="string">'02130'</span>,<span class="number">1</span>);</div></pre></td></tr></table></figure>
<p>It take two street names, state, city and zipcode then output multiple location candidates with ratings. The script of geocoding street addresses only need some minor changes on input table column format and function parameters to work on intersections. I’ll just post the finished whole script for reference after all discussions.</p>
<h2 id="map-to-census-block"><a href="#Map-to-Census-Block" class="headerlink" title="Map to Census Block"></a>Map to Census Block</h2><p>One important goal of my project is to map addresses to census block, then we can link the NFIRS data with other public data and produce much more powerful analysis, especially the <a href="http://www.census.gov/programs-surveys/ahs.html" target="_blank" rel="external">American Housing Survey(AHS)</a> and the <a href="https://www.census.gov/programs-surveys/acs/" target="_blank" rel="external">American Community Survey(ACS)</a>.</p>
<p>There is a <a href="http://postgis.net/docs/Get_Tract.html" target="_blank" rel="external"><code>Get_Tract</code> function</a> in Tiger Geocoder which return the <em>census tract</em> id for a location. For <em>census block</em> mapping people seemed to be just using <a href="http://postgis.org/docs/ST_Contains.html" target="_blank" rel="external">ST_Contains</a> like <a href="http://gis.stackexchange.com/questions/137870/finding-census-block-for-given-address-using-tiger-geocoder" target="_blank" rel="external">this answer</a> in stackexchange:</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> tabblock_id <span class="keyword">AS</span> <span class="keyword">Block</span>,</div><div class="line"> <span class="keyword">substring</span>(tabblock_id <span class="keyword">FROM</span> <span class="number">1</span> <span class="keyword">FOR</span> <span class="number">11</span>) <span class="keyword">AS</span> Blockgroup,</div><div class="line"> <span class="keyword">substring</span>(tabblock_id <span class="keyword">FROM</span> <span class="number">1</span> <span class="keyword">FOR</span> <span class="number">9</span>) <span class="keyword">AS</span> Tract,</div><div class="line"> <span class="keyword">substring</span>(tabblock_id <span class="keyword">FROM</span> <span class="number">1</span> <span class="keyword">FOR</span> <span class="number">5</span>) <span class="keyword">AS</span> County,</div><div class="line"> <span class="keyword">substring</span>(tabblock_id <span class="keyword">FROM</span> <span class="number">1</span> <span class="keyword">FOR</span> <span class="number">2</span>) <span class="keyword">AS</span> STATE</div><div class="line"><span class="keyword">FROM</span> tabblock</div><div class="line"><span class="keyword">WHERE</span> ST_Contains(the_geom, ST_SetSRID(ST_Point(<span class="number">-71.101375</span>, <span class="number">42.31376</span>), <span class="number">4269</span>))</div></pre></td></tr></table></figure>
<p>The national data loaded by Tiger Geocoder have a table <code>tabblock</code> which have the information of census blocks. <code>ST_Contains</code> will test the spatial relationship between two geometries, in our case it will be whether polygon or multi polygon of census block contains the point of interest. The <code>Where</code> clause select the only record that satisfy this condition for the point.</p>
<p>The census block id is a 15 digits code constructed from state and county fips code, census tract id, blockgroup id and the census block number. The code example above actually are not ideal for me since it included all the prefix in each column. My code will work on the results from the geocoding script above:</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">UPDATE</span> address_table</div><div class="line"> <span class="keyword">SET</span> (tabblock_id, STATE, county, tractid)</div><div class="line"> = (<span class="keyword">COALESCE</span>(ab.tabblock_id,<span class="string">'FFFF'</span>),</div><div class="line"> <span class="keyword">substring</span>(ab.tabblock_id <span class="keyword">FROM</span> <span class="number">1</span> <span class="keyword">FOR</span> <span class="number">2</span>),</div><div class="line"> <span class="keyword">substring</span>(ab.tabblock_id <span class="keyword">FROM</span> <span class="number">3</span> <span class="keyword">FOR</span> <span class="number">3</span>),</div><div class="line"> <span class="keyword">substring</span>(ab.tabblock_id <span class="keyword">FROM</span> <span class="number">1</span> <span class="keyword">FOR</span> <span class="number">11</span>)</div><div class="line"> )</div><div class="line"><span class="keyword">FROM</span></div><div class="line"> (<span class="keyword">SELECT</span> addid</div><div class="line"> <span class="keyword">FROM</span> address_table</div><div class="line"> <span class="keyword">WHERE</span> (geomout <span class="keyword">IS</span> <span class="keyword">NOT</span> <span class="literal">NULL</span>) <span class="keyword">AND</span> (tabblock_id <span class="keyword">IS</span> <span class="literal">NULL</span>)</div><div class="line"> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> block_sample_size) <span class="keyword">AS</span> a</div><div class="line"> <span class="keyword">LEFT</span> <span class="keyword">JOIN</span> (<span class="keyword">SELECT</span> a.addid, b.tabblock_id</div><div class="line"> <span class="keyword">FROM</span> address_table <span class="keyword">AS</span> a, tabblock <span class="keyword">AS</span> b</div><div class="line"> <span class="keyword">WHERE</span> (geomout <span class="keyword">IS</span> <span class="keyword">NOT</span> <span class="literal">NULL</span>) <span class="keyword">AND</span> (a.tabblock_id <span class="keyword">IS</span> <span class="literal">NULL</span>)</div><div class="line"> <span class="keyword">AND</span> ST_Contains(b.the_geom, ST_SetSRID(ST_Point(a.lon, a.lat), <span class="number">4269</span>))</div><div class="line"> <span class="keyword">ORDER</span> <span class="keyword">BY</span> addid <span class="keyword">LIMIT</span> block_sample_size) <span class="keyword">AS</span> ab <span class="keyword">ON</span> a.addid = ab.addid</div><div class="line"><span class="keyword">WHERE</span> a.addid = address_table.addid;</div></pre></td></tr></table></figure>
<ul>
<li>I didn’t include the state fips as prefix in county fips since strictly speaking county fips is 3 digits, although you always need to use it with state fips together. I included the census tract because some location may have ambiguity but the census tract most likely will be same.</li>
<li>This code is based on same principle of the geocoding code with a little bit change:<ul>
<li>It need to work on top of geocoding results, so the sample for each run are the rows that have been geocoded (thus geomout column is not <code>NULL</code>), but not yet mapped to census block (<code>tabblock_id</code> is <code>NULL</code>), and sorted by <code>addid</code>, limited by sample size.</li>
<li>Similar to geocode code, I need to join the sample <code>addid</code> with lookup result to make sure even the rows without return value are included in result. Then the <code>NULL</code> rating value of those rows will be replaced with an special value to mark the row as processed already but without match. This step is critical for the updating process to work properly.</li>
</ul>
</li>
</ul>
<p>In theory this mapping is much easier than geocoding since there is not much ambiguity. And every address should belong to some census block. Actually I found <a href="http://gis.stackexchange.com/questions/170217/find-census-block-for-street-intersection-with-tiger-geocoder" target="_blank" rel="external">many street intersections don’t have matches</a>. I tested the same address in <a href="http://geocoding.geo.census.gov/geocoder/" target="_blank" rel="external">the offcial Census website</a> and it find the match! </p>
<p>Here is the example data I used, the <code>geocode_intersection</code> function returned a street address and coordinates from two streets:</p>
<pre><code>row_seq | 2716
street_1 | FLORIDA AVE NW
street_2 | MASSACHUSETTS AVE NW
state | DC
city | WASHINGTON
zip | 20008
addid | 21
rating | 3
lon | -77.04879
lat | 38.91150
output_address | 2198 Florida Ave NW, Washington, DC 20008
</code></pre><p>I used different test methods and found interesting results:</p>
<table>
<thead>
<tr>
<th style="text-align:left">input</th>
<th style="text-align:left">method</th>
<th style="text-align:left">result</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">2 streets</td>
<td style="text-align:left">geocode_intersection</td>
<td style="text-align:left">(-77.04879, 38.91150)</td>
</tr>
<tr>
<td style="text-align:left">geocode_intersection output address</td>
<td style="text-align:left">geocode</td>
<td style="text-align:left">(-77.04871, 38.91144)</td>
</tr>
<tr>
<td style="text-align:left">geocode_intersection output address</td>
<td style="text-align:left">Census website</td>
<td style="text-align:left">(-77.048775,38.91151) GEOID: 110010055001010</td>
</tr>
<tr>
<td style="text-align:left">geocode_intersection coordinates, 5 digits</td>
<td style="text-align:left">Census website</td>
<td style="text-align:left">census block GEOID: 110010041003022</td>
</tr>
<tr>
<td style="text-align:left">geocode_intersection coordinates, 5 digits</td>
<td style="text-align:left">Tiger Geocoder</td>
<td style="text-align:left">census block GEOID: 110010041003022</td>
</tr>
<tr>
<td style="text-align:left">geocode_intersection coordinates, 6 digits</td>
<td style="text-align:left">Tiger Geocoder</td>
<td style="text-align:left">census block: no match</td>
</tr>
</tbody>
</table>
<ul>
<li>If I feed the street address output from <code>geocode_intersection</code> back to <code>geocode</code> function, the coordinates output will have slight difference with the coordinates output from <code>geocode_intersection</code>. My theory is that <code>geocode_intersection</code> function first calculate the intersection point from the geometry information of two streets, then reverse geocode that coordinates into street address. The street number is usually interpolated so if you geocode that street address back to coordinates there could be difference. <strong>Update</strong>: <a href="http://gis.stackexchange.com/a/115666" target="_blank" rel="external">Some interesting background information about the street address locations and ranges</a>.</li>
<li>The slight difference may result in different census block output, probably because these locations are on street intersections which are more than likely to be boundary of census blocks.</li>
<li>Using the geometry or the coordinates output (6 digits after dot) from <code>geocode_intersection</code> for <code>ST_Contains</code> could have empty result, i.e. no census block have contain relationship of these points. I’m not sure the reason of this, only observed that using coordinates with 5 digits after dot will find a match in most time. This is an open question need to consulting with the experts on this.</li>
</ul>
<h2 id="work-in-batch"><a href="#Work-In-Batch" class="headerlink" title="Work In Batch"></a>Work In Batch</h2><p>I was planning to geocode addresses by states to improve the performance, so I’ll need to process lots of files. After some experimentations, I developed a batch workflow:</p>
<ol>
<li><p>The script discussed above can take a csv input, geocode addresses, map census block, update the table. I used this psql command line to execute the script. Note I have a .pgpass file in my user folder so I don’t need to write database password in the command line, and I saved a copy of the console messages to log file. </p>
<pre><code>psql -d census -U postgres -h localhost -w -v input_file="'/home/ubuntu/geocode/address_input/address_sample.csv'" -f geocode_batch.sql 2>&1 | tee address.log
</code></pre></li>
<li><p>I need to save the result table to csv. The <code>Copy</code> in SQL require the postgresql user to have permission for output file, so I used the psql meta command <code>\Copy</code> instead. It can be written inside the PL/pgSQL script but I cannot make it to use parameter as output file name. So I have to write another psql command line:</p>
<pre><code>psql -d census -U postgres -h localhost -w -c '\copy address_table to /home/ubuntu/geocode/address_output/1.csv csv header'
</code></pre></li>
<li><p>The above two lines will take care of one input file. If I put all input files into one folder, I can generate a shell script to process each input file with above command line. At first I tried to use shell script directly to read file names and loop with them, but it became very cumbersome and error prone because I want to generate <em>output file</em> name dynamically from <em>input file</em> names then transfer them as psql command line parameters. I ended up with a simple python script to generate the shell script I wanted. </p>
<p> Before running the shell script I need to change the permission:</p>
<pre><code>chmod +x ./batch.sh
sh ./batch.sh
</code></pre></li>
</ol>
<h2 id="exception-handling-and-progress-report"><a href="#Exception-Handling-And-Progress-Report" class="headerlink" title="Exception Handling And Progress Report"></a>Exception Handling And Progress Report</h2><p>The NFIRS data have many ill formated addresses that could cause problem for <code>geocode</code> function. I decided that it’s better to process one year’s data first, then collect all the problem cases and design a cleaning procedure before processing other years’ data. </p>
<p>This means the workflow should be able to skip on errors and mark the problems. The script above can handle the cases when there is no match returned from the <code>geocode</code> function, but any exception occurred in runtime will interrupt the script. Since the <code>geocode_sample</code> is called in a loop inside the main control function, the whole script is one single transaction. Once the transaction is interrupted, it will be rolled back and all the previous geocoding results are lost. See <a href="http://www.postgresql.org/docs/current/static/plpgsql-structure.html" target="_blank" rel="external">more about this</a>. </p>
<p>However, <a href="http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING" target="_blank" rel="external">adding an EXCEPTION clause</a> effectively forms a subtransaction that can be rolled back without affecting the outer transaction.</p>
<p>Therefore I added this exception handling part in the <code>geocode_sample</code> function:</p>
<figure class="highlight"><table><tr><td class="code"><pre><div class="line">CREATE OR REPLACE FUNCTION geocode_sample(sample_size integer) </div><div class="line"> RETURNS void AS $$</div><div class="line">DECLARE OUTPUT address_table%ROWTYPE; </div><div class="line">BEGIN</div><div class="line">...</div><div class="line">EXCEPTION</div><div class="line">WHEN OTHERS THEN</div><div class="line"> SELECT * INTO OUTPUT </div><div class="line"> FROM address_table </div><div class="line"> WHERE rating IS NULL ORDER BY addid LIMIT 1;</div><div class="line"> RAISE NOTICE '<address error> in samples started from: %', OUTPUT;</div><div class="line"> RAISE notice '-- !!! % % !!!--', SQLERRM, SQLSTATE;</div><div class="line"> UPDATE address_table</div><div class="line"> SET rating = -2</div><div class="line"> FROM (SELECT addid</div><div class="line"> FROM address_table </div><div class="line"> WHERE rating IS NULL ORDER BY addid LIMIT sample_size</div><div class="line"> ) AS sample</div><div class="line"> WHERE sample.addid = address_table.addid;</div><div class="line">END;</div><div class="line">$$ LANGUAGE plpgsql;</div></pre></td></tr></table></figure>
<p>This code will catch any exception, print the first row of current sample to notify the location of error, also print the original exception message. </p>
<pre><code>psql:geocode_batch.sql:179: NOTICE: <address error> in samples started from: (1501652," RIVER (AT BLOUNT CO) (140 , KNOXVILLE, TN 37922",37922,27556,,,,,,,,,)
CONTEXT: SQL statement "SELECT geocode_sample(sample_size)"
PL/pgSQL function geocode_table() line 24 at PERFORM
psql:geocode_batch.sql:179: NOTICE: -- !!! invalid regular expression: parentheses () not balanced 2201B !!!--
</code></pre><p>To make sure the script will continue work on the remaining rows, it also set the rating column of the current sample to be <code>-2</code>, thus they will be skipped in latter runs. </p>
<p>One catch of this method is the whole sample will be skipped even only one row in it caused problem, then I may need to check them again after one pass. However I didn’t find a better way to find the row caused the exception other than set up some marker for every row and keep updating it. Instead, I tested the performance with different sample size, i.e. how many rows will the <code>geocode_sample</code> function process in one run. It turned out sample size 1 didn’t have obvious performance penalty, maybe because the extra cost of small sample is negligible compared to the geocoding function cost. With a sample size 1 the exception handling code will always mark the problematic row only, and the code is much simpler.</p>
<p>Another important feature I want is progress report. If I split the NFIRS data by state, one state data often has tens of thousands of rows and take several hours to finish. I don’t want to find error or problem until it finishes. So I added some progress report like this:</p>
<pre><code>psql:geocode_batch.sql:178: NOTICE: > 2015-11-18 20:26:51+00 : Start on table of 10845
psql:geocode_batch.sql:178: NOTICE: > time passed | address processed <<<< address left
psql:geocode_batch.sql:178: NOTICE: > 00:00:54.3 | 100 <<<< 10745
psql:geocode_batch.sql:178: NOTICE: > 00:00:21.7 | 200 <<<< 10645
</code></pre><p>First it report the size of whole table, then the time taken for every 100 rows processed, and how many rows are left. It’s pretty obvious in above example that the first 100 rows take more time. It’s because many address with ill formated zipcode were sorted on top.</p>
<p>Similarly, the mapping of census block have a progress report:</p>
<pre><code>psql:geocode_batch.sql:178: NOTICE: ==== start mapping census block ====
psql:geocode_batch.sql:178: NOTICE: # time passed | address to block <<<< address left
psql:geocode_batch.sql:178: NOTICE: # 00:00:02.6 | 1000 <<<< 9845
psql:geocode_batch.sql:178: NOTICE: # 00:00:03.4 | 2000 <<<< 8845
</code></pre><h2 id="summary-and-open-questions"><a href="#Summary-And-Open-Questions" class="headerlink" title="Summary And Open Questions"></a>Summary And Open Questions</h2><p><strong>I put everything in <a href="https://github.com/dracodoc/Geocode" target="_blank" rel="external">this Github repository</a></strong>. </p>
<p>My script has processed almost one year’s data, but I’m not really satisfied with the performance yet. When I tested the 44185 MD, DC addresses in the AWS free tier server with MD, DC database, the average time per row is about 60 ms, while the full server with all states have the average time of 342 ms. Some other states with more ill formated addresses have worse performance. </p>
<p>I have updated the Tiger database index and tuned the postgresql configurations. I can try parallel but the cpu should not be the bottle neck here, and <a href="http://geeohspatial.blogspot.com/2013/12/a-simple-function-for-parallel-queries_18.html" target="_blank" rel="external">the hack I found to enable postgresql run parallel</a> is not easily manageable. Somebody also mentioned partitioning database, but I’m not sure if this will help.</p>
<p>And here are some open questions I will ask in PostGIS community, some of them may have the potential to further improve performance:</p>
<ol>
<li><p>Why is a server with 2 states data much faster than the server with all states data? I assume it’s because the bad address that don’t have a exact hit at first will cost much more time when the geocoder checked all states. With only 2 states this search is limited and stopped much early. This can be further verified by comparing the performance of two test cases in each server, one with exact match perfect address, another one with lots of invalid addresses.</p>
<p> There is a <code>restrict_region</code> parameter in <code>geocode</code> function looks promising if it can limit the search range, since I have enough information or reason to believe the state information is correct. I wrote a query trying to use one state’s geometry as the limiting parameter:</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> geocode(<span class="string">'501 Fairmount DR , Annapolis, MD 20137'</span>, <span class="number">1</span>, the_geom) </div><div class="line"> <span class="keyword">FROM</span> tiger.state <span class="keyword">WHERE</span> statefp = <span class="string">'24'</span>;</div></pre></td></tr></table></figure>
<p> and compared the performance with the simple version</p>
<figure class="highlight sql"><table><tr><td class="code"><pre><div class="line"><span class="keyword">SELECT</span> geocode(<span class="string">'501 Fairmount DR , Annapolis, MD 20137'</span>,<span class="number">1</span>);</div></pre></td></tr></table></figure>
<p> I didn’t find performance gain with the parameter. Instead it lost the performance gain from caching, which usually came from running same query immediately again because all the needed data have been cached in RAM. </p>
<p> Maybe my usage is not proper, or this parameter is not intended to work as I expected. However if the search range can be limited, the performance gain could be substantial.</p>
</li>
<li><p>Will normalizing address first improve the performance? I don’t think it will help unless I can filter bad address and remove them from input totally, which may not be the case for my usage of NFIRS data. The new PostGIS 2.2.0 looks promising but the ansible playbook is not updated yet, and I haven’t have the chance to setup the server again by myself.</p>
<p> One possible improvement to my workflow is to try to separate bad formatted addresses with the good ones. I already separated some of them by sorting by zipcode, but there are some addresses with a valid zipcode are obviously incomplete. The most important reason of separate all input by state is to have the server cache all the data needed in RAM. If the server meet some bad formatted addresses in the middle of table and started to look up all states, the already loaded whole state cache could be messed up. Then the good addresses need the geocoder to read state data from hard drive again. If the cache update statistics could be summarized from the server log, this theory can be verified.</p>
<p> I’ve almost finished one year’s data. After it finished I’ll design more clean up procedures, and maybe move all suspicious addresses out to make sure the better shaped addresses geocoding are not interrupted.</p>
</li>
<li><p>Will replacing the default normalizing function with the <a href="http://postgis.net/docs/Address_Standardizer.html" target="_blank" rel="external">Address Standardizer</a> help? I didn’t find the normalizing step too time consuming in my experiments. However if it can produce better formated address from bad input, that could help the geocoding process.</p>
</li>
<li>Why 6 digits coordinates of street intersections output often don’t have matched census block, but coordinates round up to 5 digits have match in most of time?</li>
</ol>
<h2 id="version-history"><a href="#Version-History" class="headerlink" title="Version History"></a>Version History</h2><ul>
<li>2015-11-19 : First version.</li>
<li>2016-05-11 : Added Summary.</li>
<li>2016-08-19 : Syntax highlighting.</li>
</ul>
]]></content>
<summary type="html">
<h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>I discussed all the problem I met, approaches I tried, and improvement I achieved in the Geocoding task.</li>
<li>There are many subtle details, some open questions and areas can be improved.</li>
<li>The final working script and complete workflow are hosted in <a href="https://github.com/dracodoc/Geocode" target="_blank" rel="external">github</a>.</li>
</ul>
</summary>
<category term="Geocoding" scheme="https://dracodoc.github.io/categories/Geocoding/"/>
<category term="Geocoding" scheme="https://dracodoc.github.io/tags/Geocoding/"/>
<category term="Tiger Geocoder" scheme="https://dracodoc.github.io/tags/Tiger-Geocoder/"/>
<category term="PostGIS" scheme="https://dracodoc.github.io/tags/PostGIS/"/>
<category term="postgresql" scheme="https://dracodoc.github.io/tags/postgresql/"/>
<category term="DataKind" scheme="https://dracodoc.github.io/tags/DataKind/"/>
<category term="NFIRS" scheme="https://dracodoc.github.io/tags/NFIRS/"/>
</entry>
<entry>
<title>Geocoding 18 million addresses with PostGIS Tiger Geocoder</title>
<link href="https://dracodoc.github.io/2015/11/17/Geocoding/"/>
<id>https://dracodoc.github.io/2015/11/17/Geocoding/</id>
<published>2015-11-17T16:36:10.000Z</published>
<updated>2016-08-19T13:46:40.772Z</updated>
<content type="html"><![CDATA[<h2 id="summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><ul>
<li>This post discussed the background, approaches, windows and linux environment setup for my Geocoding task.</li>
<li>See more details about the script and workflow in next post.</li>
</ul>
<a id="more"></a>
<h2 id="background"><a href="#Background" class="headerlink" title="Background"></a>Background</h2><p>I found I want to geocode lots of addresses in my <a href="http://dracodoc.github.io/2015/11/11/Red-Cross-Smoke-Alarm-Project/">Red Cross Smoke Alarm Project</a>. The <a href="https://github.com/brooksandrew/arc_smoke_alarm/wiki/References-and-Data-Sources#public-data-sources" target="_blank" rel="external">NFIRS data</a> have 18 million addresses in 9 years data, and I would like to </p>
<ul>
<li>verify all the addresses because many inputs have quality problems.</li>
<li>map street address to coordinates, so we can map and do more geospatial analysis.</li>
<li>map street address to census block, then we can link NFIRS data to other public data like census data of <a href="https://www.census.gov/programs-surveys/acs/" target="_blank" rel="external">American Community Survey(ACS)</a> and <a href="http://www.census.gov/programs-surveys/ahs.html" target="_blank" rel="external">American Housing Survey(AHS)</a>).</li>
</ul>
<h2 id="possible-approaches"><a href="#Possible-Approaches" class="headerlink" title="Possible Approaches"></a>Possible Approaches</h2><p>I did some research on the possible options:</p>
<ul>
<li>Online service. Most free online API have limit, and paid service would be too expensive for my task. Surprisingly FCC have <a href="https://www.fcc.gov/developers/census-block-conversions-api" target="_blank" rel="external">an API</a> to map coordinates to census block which didn’t mention limit, but it cannot geocode street address to coordinates.</li>
<li><a href="http://www.tigergeocoder.com/" target="_blank" rel="external">This company</a> provide service in Amazon EC2 for a fee. They have <a href="https://github.com/bibanul/tiger-geocoder/wiki/Running-your-own-Geocoder-in-Amazon-EC2" target="_blank" rel="external">more information about their setup in github</a>. What I did is actually a similar approach but in a totally DIY way.</li>
<li>Setup your own geocoder. <a href="http://postgis.net/docs/Extras.html#Tiger_Geocoder" target="_blank" rel="external">Tiger geocoder</a> is a PostGIS extension which use Census Tiger data to geocode addresses.</li>
</ul>
<p>PostGIS can work in both windows and linux, and Enigma.io has shared their <a href="https://github.com/enigma-io/ansible-tiger-geocoder-playbook" target="_blank" rel="external">automated Tiger Geocoder setup tool</a> for linux. However the Tiger database itself need 105G space and I don’t have a linux box for that(Amazon AWS free tier service only allow for 30G storage), so I decided to install PostGIS in windows and experiment with everything first.</p>
<h2 id="windows-setup"><a href="#Windows-Setup" class="headerlink" title="Windows Setup"></a>Windows Setup</h2><p>I need to install postgresql server, PostGIS extension and Tiger geocoder extension. <a href="http://www.bostongis.com/?content_name=postgis_tut01" target="_blank" rel="external">This</a> is a very detailed installation guide for PostGIS in windows. I’ll just add some notes from my experience:</p>
<ul>
<li>It’s best if you could install the database in SSD drive. My first setup was in SSD and only have two states data, the geocoding performance was pretty good. Then I need to download all the states so I have to move the database to regular hard drive according to <a href="https://wiki.postgresql.org/wiki/Change_the_default_PGDATA_directory_on_Windows" target="_blank" rel="external">this guide</a> (<em>note the data folder path value cannot have the trialling escape, otherwise the PostgreSQL Service will just fail</em>). After that the geocoding performance dropped considerably.</li>
<li>The pgAdmin is easy to use. I used SQL Query, View Data (or view top 100 rows if the table is huge) a lot. The explain analyze function in the SQL Query tool is also very intuitive.</li>
</ul>
<p>With server and extension installed, I need to load Tiger data. The Tiger geocoder provided scripts generating functions for you to download Tiger data from Census ftp then set up the database. <a href="http://postgis.net/docs/Loader_Generate_Nation_Script.html" target="_blank" rel="external">The official documentation</a> didn’t provide enough information for me, so I have to search and tweak a lot. At first I tried the commands from SQL query tool but it didn’t show any result. Later I solved this problem with hints from <a href="http://gis.stackexchange.com/questions/81907/install-postgis-and-tiger-data-in-ubuntu-12-04" target="_blank" rel="external">this guide</a>, although it was written for Ubuntu.</p>
<ul>
<li>You need to install 7z.exe and wget windows version and record their path.</li>
<li>Create a directory for download. Postgresql need to have permission for that folder. I just created the folder in same level with the postgresql database folder, and both of them have user group <code>Authenticated users</code> in full control. If you write a sql copy command to read csv file in some other folder that don’t have this user permission, there could be a permission denied error.</li>
<li><p>Start pgAdmin, connect to the GIS database you created in installation, run psql tool from pgAdmin, input <code>\a</code> <code>\t</code> to set up format first, and set output file by</p>
<pre><code>\o nation_generator.bat
</code></pre><p>then run </p>
<pre><code>SELECT loader_generate_nation_script('windows');
</code></pre><p>to generate the script to load national tables. It will be a file with the name specified with <code>\o nation_generator.bat</code> before located in the same folder of <code>psql.exe</code>, which should be the postgresql bin folder.</p>
</li>
<li><p>Technically you can input the parameters specific to your system settings in the table <code>loader_variables</code> and <code>loader_platform</code> which are under <code>tiger</code> schema. However after I inputed the parameters, only the stage folder(i.e. where to download data to) was taken into generated script. My guess is the file path with spaces need be proper escaped and quoted. The script generating function is reading from database then write to file, hat means the file path will go through several different internal representations, which make the escaping and quoting more complicated. I just replaced the default parameters with mine in the generated script later. <strong>Update</strong>: I found <a href="http://gis.stackexchange.com/questions/116803/installing-tiger-geocoder" target="_blank" rel="external">this answer</a> later. I probably should use <code>SET</code> command instead of directly editing the table columns. Anyway, replacing the settings afterwards still works, and you need to double check it.</p>
</li>
<li>All the parameters are listed in the first section of generated script, and <code>cd your_stage_folder</code> will be used several times through the script. You need to edit the parameters in first section and make sure the stage folder is correct in all places.</li>
<li><p>After the national data is loaded by running with the script, you can specify the states that you want to load. Actually the tiger database support 56 states/regions. You can find them by </p>
<pre><code>select stusps, name from tiger.state order by stusps;
</code></pre></li>
<li><p>Start psql again, go through similar steps and run</p>
<pre><code>SELECT loader_generate_script(ARRAY['VA','MD'], 'windows');
</code></pre><p> Put the states abbreviations that you want in the array. Note if you copy the query results it will be quoted with double quote by default, but you need single quote in SQL. You can change the pgAdmin output setting in <code>Options - Query tool - Results grid</code>.</p>
</li>
<li><p>The generated script will have one section for each state, each section have parameters set in beginning. You need to replace the parameters and the <code>cd your_stage_folder</code> to correct values. Using an editor that support multi line search replace will make this much easier.</p>
</li>
<li>I don’t want to load 56 states in one script. If anything went wrong it will be bothersome to start again from last point. I wanted to split the big script into 56 ones, one state each. I searched for a while and didn’t find a software to do this, then I just wrote a python script.</li>
<li><p>First add a marker in the script to separate states. I replaced all occurrences of</p>
<pre><code>set TMPDIR=e:\data\gisdata\temp\\
</code></pre><p>to </p>
<pre><code>:: ---- end state ----
set TMPDIR=e:\data\gisdata\temp\\
</code></pre><p>then deleted the <code>:: ---- end state ----</code> marker in the first line. This make the marker appear in the end of each state section. Note the <code>::</code> is commenting symbol in dos bat so it will not interfere with the script.</p>
<p> Then I run this python script to split it by states.</p>
</li>