-
Notifications
You must be signed in to change notification settings - Fork 0
/
sievert.tex
1144 lines (1036 loc) · 52 KB
/
sievert.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\begin{article}
\title{Taming PITCHf/x Data with \pkg{XML2R} and \pkg{pitchRx}}
\author{by Carson Sievert}
\maketitle
\abstract{\pkg{XML2R} is a framework that reduces the effort required to transform
XML content into tables in a way that preserves parent to child relationships.
\pkg{pitchRx} applies \pkg{XML2R}'s grammar for XML manipulation
to Major League Baseball Advanced Media (MLBAM)'s Gameday data. With
\pkg{pitchRx}, one can easily obtain and store Gameday data in a
remote database. The Gameday website hosts a wealth of XML data, but
perhaps most interesting is PITCHf/x. Among other things, PITCHf/x
data can be used to recreate a baseball's flight path from a pitcher's
hand to home plate. With \pkg{pitchRx}, one can easily create animations
and interactive 3D scatterplots of the baseball's flight path. PITCHf/x
data is also commonly used to generate a static plot of baseball locations
at the moment they cross home plate. These plots, sometimes called
\dfn{strike-zone plots}, can also refer to a plot of event probabilities
over the same region. \pkg{pitchRx} provides an easy and robust way
to generate strike-zone plots using the \pkg{ggplot2} package. }
\section{Introduction}
\subsection{What is PITCHf/x?}
PITCHf/x is a general term for a system that generates a series of
3D measurements of a baseball's path from a pitcher's hand to home
plate \citep{patent}.%
\footnote{A \dfn{pitcher} throws a ball to the opposing \dfn{batter}, who
stands besides home plate and tries to hit the ball into the field
of play.%
} In an attempt to estimate the location of the ball at any time point,
a quadratic regression model with nine parameters (defined by the
equations of motion for constant linear acceleration) is fit to each
pitch. Studies with access to the actual measurements suggest that
this model is quite reasonable -- especially for non-knuckleball pitches
\citep{trajecoryAnalysis}. That is, the fitted path often provides
a reasonable estimate (within a couple of inches) of the actual locations.
Unfortunately, only the parameter estimates are made available to
the public. The website that provides these estimates is maintained
by MLBAM and hosts a wealth of other baseball related data used to
inform MLB's Gameday webcast service in near real time.
\subsection{Why is PITCHf/x important?}
On the business side of baseball, using statistical analysis to scout
and evaluate players has become mainstream. When PITCHf/x was first
introduced, \citep{slate} proclaimed it as, \begin{quote} ``The new technology that will change statistical analysis [of baseball] forever.'' \end{quote}
PITCHf/x has yet to fully deliver this claim, partially due to the
difficulty in accessing and deriving insight from the large amount
of complex information. By providing better tools to collect and visualize
this data, \CRANpkg{pitchRx} makes PITCHf/x analysis more accessible
to the general public.
\subsection{PITCHf/x applications}
PITCHf/x data is and can be used for many different projects. It can
also complement other baseball data sources, which poses an interesting
database management problem. Statistical analysis of PITCHf/x data
and baseball in general has become so popular that it has helped expose
statistical ideas and practice to the general public. If you have
witnessed television broadcasts of MLB games, you know one obvious
application of PITCHf/x is locating pitches in the strike-zone as
well as recreating flight trajectories, tracking pitch speed, etc.
Some on-going statistical research related to PITCHf/x includes: classifying
pitch types, predicting pitch sequences, and clustering pitchers with
similar tendencies \citep{curve}.
\subsection[Contributions of pitchRx and XML2R]{Contributions of \pkg{pitchRx} and \pkg{XML2R}}
\pkg{pitchRx} has two main focuses \citep{pitchRx}. The first focus
is to provide easy access to Gameday data. Not only is \pkg{pitchRx}
helpful for collecting this data in bulk, but it has served as a helpful
teaching and research aide (\url{http://baseballwithr.wordpress.com/}
is one such example). Methods for collecting Gameday data existed
prior to \pkg{pitchRx}; however, these methods are not easily extensible
and require juggling technologies that may not be familiar or accessible
\citep{database}. Moreover, these working environments are less desirable
than R for data analysis and visualization. Since \pkg{pitchRx} is
built upon \CRANpkg{XML2R}'s united framework, it can be easily modified
and/or extended \citep{XML2R}. For this same reason, \pkg{pitchRx}
serves as a model for building customized XML data collection tools
with \pkg{XML2R}.
The other main focus of \pkg{pitchRx} is to simplify the process
creating popular PITCHf/x graphics. Strike-zone plots and animations
made via \pkg{pitchRx} utilize the extensible \CRANpkg{ggplot2}
framework as well as various customized options \citep{ggplot2}.
\pkg{ggplot2} is a convenient framework for generating strike-zone
plots primarily because of its facet schema which allows one to make
visual comparisons across any combination of discrete variable(s).
Interactive 3D scatterplots are based on the \CRANpkg{rgl} package
and useful for gaining a new perspective on flight trajectories \citep{rgl}.
\section{Getting familiar with Gameday}
Gameday data is hosted and made available for free thanks to MLBAM
via \url{http://gd2.mlb.com/components/game/mlb/}.%
\footnote{Please be respectful of this service and store any information after
you extract it instead of repeatedly querying the website. Before
using any content from this website, please also read the \href{http://gdx.mlb.com/components/copyright.txt}{copyright}.%
} From this website, one can obtain many different types of data besides
PITCHf/x. For example, one can obtain everything from \href{http://gd2.mlb.com/components/game/mlb/year_2013/month_07/day_16/gid_2013_07_16_aasmlb_nasmlb_1/media/instadium.xml}{structured media metadata}
to \href{http://gd2.mlb.com/components/game/mlb/twitter/anaInsiderTweets.xml}{insider tweets}.
In fact, this website's purpose is to serve data to various \url{http://mlb.com}
web pages and applications. As a result, some data is redundant and
the format may not be optimal for statistical analysis. For these
reasons, the \code{scrape} function is focused on retrieving data
that is useful for PITCHf/x analysis and providing it in a convenient
format for data analysis.
Navigating through the MLBAM website can be overwhelming, but it helps
to recognize that a homepage exists for nearly every day and every
game. For example, \url{http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/}
displays numerous hyperlinks to various files specific to February
26th, 2011. On this page is a hyperlink to \href{http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/miniscoreboard.xml}{miniscoreboard.xml}
which contains information on every game played on that date. This
page also has numerous hyperlinks to game specific pages. For example,
\href{http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/gid_2011_02_26_phimlb_nyamlb_1/}{gid\_2011\_02\_26\_phimlb\_nyamlb\_1/}
points to the homepage for that day's game between the NY Yankees
and Philadelphia Phillies. On this page is a hyperlink to the \href{http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/gid_2011_02_26_phimlb_nyamlb_1/players.xml}{players.xml}
file which contains information about the players, umpires, and coaches
(positions, names, batting average, etc.) coming into that game.
Starting from a particular game's homepage and clicking on the \href{http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/gid_2011_02_26_phimlb_nyamlb_1/inning/}{inning/}
directory, we \emph{should} see another page with links to the \href{http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/gid_2011_02_26_phimlb_nyamlb_1/inning/inning_all.xml}{inning\_all.xml}
file and the \href{http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/gid_2011_02_26_phimlb_nyamlb_1/inning/inning_hit.xml}{inning\_hit.xml}
file. If it is available, the \code{inning\_all.xml} file contains
the PITCHf/x data for that game. It's important to note that this
file won't exist for some games, because some games are played in
venues that do not have a working PITCHf/x system in place. This is
especially true for preseason games and games played prior to the
2008 season when the PITCHf/x system became widely adopted.%
\footnote{In this case, \code{scrape} will print ``failed to load HTTP resource''
in the R console (after the relevant file name) to indicate that no
data was available.%
} The \code{inning\_hit.xml} files have manually recorded spatial
coordinates of where a home run landed or where the baseball made
initial contact with a defender after it was hit into play.
The relationship between these XML files and the tables returned by
\code{scrape} is presented in Table~\ref{table:pitchfx}. The \code{pitch}
table is extracted from files whose name ends in \code{inning\_all.xml}.
This is the only table returned by \code{scrape} that contains data
on the pitch-by-pitch level. The \code{atbat}, \code{runner}, \code{action}
and \code{hip} tables from this same file are commonly labeled somewhat
ambiguously as play-by-play data. The \code{player}, \code{coach},
and \code{umpire} tables are extracted from \code{players.xml} and
are classified as game-by-game since there is one record per person
per game. Figure~\ref{fig:relations} shows how these tables can
be connected to one another in a database setting. The direction of
the arrows represent a one to possibly many relationship. For example,
at least one pitch is thrown for each \dfn{at bat} (that is, each
bout between pitcher and batter) and there are numerous at bats within
each game.
In a rough sense, one can relate tables returned by \code{scrape}
back to XML nodes in the XML files. For convenience, some information
in certain XML nodes are combined into one table. For example, information
gleaned from the `top', `bottom', and `inning' XML nodes within \code{inning\_all.xml}
are included as \code{inning} and \code{inning\_side} fields in
the \code{pitch}, \code{po}, \code{atbat}, \code{runner}, and
\code{action} tables. This helps reduce the burden of merging many
tables together in order to have inning information on the play-by-play
and/or pitch-by-pitch level. Other information is simply ignored simply
because it is redundant. For example, the `game' node within the \code{players.xml}
file contains information that can be recovered from the \code{game}
table extracted from the \code{miniscoreboard.xml} file. If the reader
wants a more detailed explanation of fields in these tables, \citet{baseball}
provide nice overview.
\begin{widetable}[ht]
\centering % used for centering table
\begin{tabular}{cccc}
\toprule
\textbf{\begin{tabular}[c]{@{}c@{}} Source file \\ suffix \end{tabular}} &
\textbf{\begin{tabular}[c]{@{}c@{}} Information \\ level \end{tabular}} &
\textbf{XML nodes} &
\textbf{\begin{tabular}[c]{@{}c@{}} Tables returned\\ by \code{scrape} \end{tabular}} \\
\midrule
\code{miniscoreboard.xml} & game-by-game & \begin{tabular}[c]{@{}c@{}} games, game, \\ game\_media, media \end{tabular} & game, media \\[12pt]
\midrule
\code{players.xml} & game-by-game & \begin{tabular}[c]{@{}c@{}} game, team, player, \\ coach, umpire \end{tabular} &
\begin{tabular}[c]{@{}c@{}} player, coach, \\ umpire \end{tabular} \\[12pt]
\midrule
\code{inning\_all.xml} & \begin{tabular}[c]{@{}c@{}} play-by-play, \\ pitch-by-pitch \end{tabular} &
\begin{tabular}[c]{@{}c@{}} game, inning, bottom, top, \\ atbat, po, pitch, runner, action \end{tabular} &
\begin{tabular}[c]{@{}c@{}} atbat, po, pitch, \\ runner, action \end{tabular} \\[18pt]
\midrule
\code{inning\_hit.xml} & play-by-play & hitchart, hip & hip \\
\bottomrule
\end{tabular}
\caption{Structure of PITCHf/x and related Gameday data sources accessible to \code{scrape}}
\label{table:pitchfx}
\end{widetable}
\begin{figure}
\centerline{\includegraphics[scale = 1.25]{Drawing1.pdf}}
\caption{Table relations between Gameday data accessible via \code{scrape}.
The direction of the arrows indicate a one to possibly many relationship.
\label{fig:relations}}
\end{figure}
\section[Introducing XML2R]{Introducing \pkg{XML2R}}
\pkg{XML2R} adds to the \href{http://cran.r-project.org/web/views/WebTechnologies.html}{CRAN Task View on Web Technologies and Services}
by focusing on the transformation of XML content into a collection
of tables. Compared to a lower-level API like the \pkg{XML} package,
it can significantly reduce the amount of coding and cognitive effort
required to perform such a task. In contrast to most higher-level
APIs, it does not make assumptions about the XML structure or its
source. Although \pkg{XML2R} works on any structure, performance
and user experience are enhanced if the content has an inherent relational
structure. \pkg{XML2R}'s novel approach to XML data collection breaks
down the transformation process into a few simple steps and allows
the user to decide how to apply those steps.
The next few sections demonstrate how \pkg{pitchRx} leverages \pkg{XML2R}
in order to produce a collection of tables from \code{inning\_all.xml}
files. A similar approach is used by \code{pitchRx::scrape} to construct
tables from the other Gameday files in Table~\ref{table:pitchfx}.
In fact, \pkg{XML2R} has also been used in the R package \href{https://github.com/cpsievert/bbscrapeR}{bbscrapeR}
which collects data from \href{http://nba.com}{nba.com} and \href{http://wnba.com}{wnba.com}.
\subsection{Constructing file names}
Sometimes the most frustrating part of obtaining data in bulk off
of the web is finding the proper collection of URLs or file names
of interest. Since files on the Gameday website are fairly well organized,
the \code{makeUrls} function can be used to construct \code{urls}
that point to every game's homepage within a window of dates.
%
\begin{Schunk}
\begin{Sinput}
urls <- makeUrls(start = "2011-06-01", end = "2011-06-01")
sub("http://gd2.mlb.com/components/game/mlb/", "", head(urls))
\end{Sinput}
\begin{Soutput}
#> [1] "year_2011/month_06/day_01/gid_2011_06_01_anamlb_kcamlb_1"
#> [2] "year_2011/month_06/day_01/gid_2011_06_01_balmlb_seamlb_1"
#> [3] "year_2011/month_06/day_01/gid_2011_06_01_chamlb_bosmlb_1"
#> [4] "year_2011/month_06/day_01/gid_2011_06_01_clemlb_tormlb_1"
#> [5] "year_2011/month_06/day_01/gid_2011_06_01_colmlb_lanmlb_1"
#> [6] "year_2011/month_06/day_01/gid_2011_06_01_flomlb_arimlb_1"
\end{Soutput}
\end{Schunk}
%
\subsection{Extracting observations}
Once we have a collection of XML \code{files}, the next step is to
parse the content into a list of \dfn{observations}. An observation
is technically defined as a matrix with one row and some number of
columns. The columns are defined by XML attributes and the XML value
(if any) for a particular XML lineage. The name of each observation
tracks the XML hierarchy so observations can be grouped together in
a sensible fashion at a later point.
%
\begin{Schunk}
\begin{Sinput}
library(XML2R)
files <- paste0(urls, "/inning/inning_all.xml")
obs <- XML2Obs(files, url.map = TRUE, quiet = TRUE)
\end{Sinput}
\end{Schunk}
%
\vspace{-.45cm}
%
\begin{Schunk}
\begin{Sinput}
table(names(obs))
\end{Sinput}
\begin{Soutput}
#>
#> game game//inning
#> 2 18
#> game//inning//bottom//action game//inning//bottom//atbat
#> 13 69
#> game//inning//bottom//atbat//pitch game//inning//bottom//atbat//po
#> 247 4
#> game//inning//bottom//atbat//runner game//inning//top//action
#> 50 20
#> game//inning//top//atbat game//inning//top//atbat//pitch
#> 79 278
#> game//inning//top//atbat//po game//inning//top//atbat//runner
#> 17 89
#> url_map
#> 1
\end{Soutput}
\end{Schunk}
%
This output tells us that 247
pitches were thrown in the bottom inning and 278
were thrown in the top inning on June 1st, 2011. Also, there are 12
different levels of observations. The list element named \code{url\_map}
is not considered an observation and was included since \code{url.map = TRUE}.
This helps avoid repeating long file names in the \code{url\_key}
column which tracks the mapping between observations and file names.
%
\begin{Schunk}
\begin{Sinput}
obs[1]
\end{Sinput}
\begin{Soutput}
#> $`game//inning//top//atbat//pitch`
#> des id type tfs tfs_zulu x y
#> [1,] "Called Strike" "3" "S" "191018" "2011-06-01T23:10:18Z" "109.87" "145.06"
#> sv_id start_speed end_speed sz_top sz_bot pfx_x pfx_z px
#> [1,] "110601_191020" "87.9" "82.1" "3.6" "1.65" "-6.7" "4.36" "-0.213"
#> pz x0 y0 z0 vx0 vy0 vz0 ax
#> [1,] "2.611" "-1.612" "50.0" "5.633" "5.808" "-128.728" "-2.903" "-11.406"
#> ay az break_y break_angle break_length pitch_type
#> [1,] "22.954" "-24.681" "23.9" "22.5" "6.5" "SI"
#> type_confidence zone nasty spin_dir spin_rate cc mt url_key
#> [1,] ".798" "5" "39" "236.697" "1538.041" "" "" "url1"
\end{Soutput}
\end{Schunk}
%
\subsection{Renaming observations}
Before grouping observations into a collection tables based on their
names, one may want to \code{re\_name} observations. Observations
with names \code{'game//inning//bottom//atbat'} and \code{'game//inning//top//atbat'}
should be included in the same table since they share XML attributes
(in other words, the observations share variables).
%
\begin{Schunk}
\begin{Sinput}
tmp <- re_name(obs, equiv = c("game//inning//top//atbat",
"game//inning//bottom//atbat"), diff.name = "inning_side")
\end{Sinput}
\end{Schunk}
%
By passing these names to the \code{equiv} argument, \code{re\_name}
determines the difference in the naming scheme and suppresses that
difference. In other words, observation names that match the \code{equiv}
argument will be renamed to \code{'game//inning//atbat'}. The information
removed from the name is not lost; however, as a new column is appended
to the end of each relevant observation. For example, notice how the
\code{inning\_side} column contains the part of the name we just
removed:
%
\begin{Schunk}
\begin{Sinput}
tmp[grep("game//inning//atbat", names(tmp))][1:2]
\end{Sinput}
\begin{Soutput}
#> $`game//inning//atbat`
#> num b s o start_tfs start_tfs_zulu batter stand b_height
#> [1,] "1" "0" "1" "0" "190935" "2011-06-01T23:09:35Z" "430001" "R" "5-10"
#> pitcher p_throws
#> [1,] "502190" "R"
#> des event score
#> [1,] "Rickie Weeks homers (10) on a fly ball to left field. " "Home Run" "T"
#> home_team_runs away_team_runs url_key inning_side
#> [1,] "0" "1" "url1" "top"
#>
#> $`game//inning//atbat`
#> num b s o start_tfs start_tfs_zulu batter stand b_height
#> [1,] "2" "0" "0" "0" "191105" "2011-06-01T23:11:05Z" "460579" "L" "6-0"
#> pitcher p_throws
#> [1,] "502190" "R"
#> des
#> [1,] "Nyjer Morgan triples (3) on a line drive to right fielder Jay Bruce. "
#> event url_key inning_side
#> [1,] "Triple" "url1" "top"
\end{Soutput}
\end{Schunk}
%
For similar reasons, other observation name pairs are renamed in a
similar fashion.
%
\begin{Schunk}
\begin{Sinput}
tmp <- re_name(tmp, equiv = c("game//inning//top//atbat//runner",
"game//inning//bottom//atbat//runner"), diff.name = "inning_side")
tmp <- re_name(tmp, equiv = c("game//inning//top//action",
"game//inning//bottom//action"), diff.name = "inning_side")
tmp <- re_name(tmp, equiv = c("game//inning//top//atbat//po",
"game//inning//bottom//atbat//po"), diff.name = "inning_side")
obs2 <- re_name(tmp, equiv = c("game//inning//top//atbat//pitch",
"game//inning//bottom//atbat//pitch"), diff.name = "inning_side")
table(names(obs2))
\end{Sinput}
\begin{Soutput}
#>
#> game game//inning
#> 2 18
#> game//inning//action game//inning//atbat
#> 33 148
#> game//inning//atbat//pitch game//inning//atbat//po
#> 525 21
#> game//inning//atbat//runner url_map
#> 139 1
\end{Soutput}
\end{Schunk}
%
\subsection{Linking observations}
After all that renaming, we now have 7
different levels of observations. Let's examine the first three observations
on the \code{game//inning} level:
%
\begin{Schunk}
\begin{Sinput}
obs2[grep("^game//inning$", names(obs2))][1:3]
\end{Sinput}
\begin{Soutput}
#> $`game//inning`
#> num away_team home_team next url_key
#> [1,] "1" "mil" "cin" "Y" "url1"
#>
#> $`game//inning`
#> num away_team home_team next url_key
#> [1,] "2" "mil" "cin" "Y" "url1"
#>
#> $`game//inning`
#> num away_team home_team next url_key
#> [1,] "3" "mil" "cin" "Y" "url1"
\end{Soutput}
\end{Schunk}
%
Before grouping observations into tables, it is usually important
preserve the parent-to-child relationships in the XML lineage. For
example, one may want to map a particular pitch back to the inning
in which it was thrown. Using the \code{add\_key} function, the relevant
value of \code{num} for \code{game//inning} observations can be
\code{recycle}d to its XML descendants.
%
\begin{Schunk}
\begin{Sinput}
obswkey <- add_key(obs2, parent = "game//inning", recycle = "num", key.name = "inning")
\end{Sinput}
\begin{Soutput}
#> A key for the following children will be generated for the game//inning node:
#> game//inning//atbat//pitch
#> game//inning//atbat//runner
#> game//inning//atbat
#> game//inning//atbat//po
#> game//inning//action
\end{Soutput}
\end{Schunk}
%
As it turns out, the \code{away\_team} and \code{home\_team} columns
are redundant as this information is embedded in the \code{url} column.
Thus, there is only one other informative attribute on this level
which is \code{next}. By recycling this value among its descendants,
we remove any need to retain a \code{game//inning} table.
%
\begin{Schunk}
\begin{Sinput}
obswkey <- add_key(obswkey, parent = "game//inning", recycle = "next")
\end{Sinput}
\begin{Soutput}
#> A key for the following children will be generated for the game//inning node:
#> game//inning//atbat//pitch
#> game//inning//atbat//runner
#> game//inning//atbat
#> game//inning//atbat//po
#> game//inning//action
\end{Soutput}
\end{Schunk}
%
It is also imperative that we can link a \code{pitch}, \code{runner},
or \code{po} back to a particular \code{atbat}. This can be done
as follows:
%
\begin{Schunk}
\begin{Sinput}
obswkey <- add_key(obswkey, parent = "game//inning//atbat", recycle = "num")
\end{Sinput}
\begin{Soutput}
#> A key for the following children will be generated for the game//inning//atbat node:
#> game//inning//atbat//pitch
#> game//inning//atbat//runner
#> game//inning//atbat//po
\end{Soutput}
\end{Schunk}
%
\subsection{Collapsing observations}
Finally, we are in a position to pool together observations that have
a common name. The \code{collapse\_obs} function achieves this by
row binding observations with the same name together and returning
a list of matrices. Note that \code{collapse\_obs} does not require
that observations from the same level to have the same set of variables
in order to be bound into a common table. In the case where variables
are missing, \code{NA}s will be inserted as values.
%
\begin{Schunk}
\begin{Sinput}
tables <- collapse_obs(obswkey)
#As mentioned before, we do not need the 'inning' table
tables <- tables[!grepl("^game//inning$", names(tables))]
table.names <- c("game", "action", "atbat", "pitch", "po", "runner")
tables <- setNames(tables, table.names)
head(tables[["runner"]])
\end{Sinput}
\begin{Soutput}
#> id start end event score rbi earned url_key inning_side
#> [1,] "430001" "" "" "Home Run" "T" "T" "T" "url1" "top"
#> [2,] "460579" "" "3B" "Triple" NA NA NA "url1" "top"
#> [3,] "460579" "3B" "" "Groundout" "T" "T" "T" "url1" "top"
#> [4,] "425902" "" "1B" "Single" NA NA NA "url1" "top"
#> [5,] "425902" "1B" "" "Pop Out" NA NA NA "url1" "top"
#> [6,] "458015" "" "1B" "Single" NA NA NA "url1" "bottom"
#> inning next num
#> [1,] "1" "Y" "1"
#> [2,] "1" "Y" "2"
#> [3,] "1" "Y" "3"
#> [4,] "1" "Y" "4"
#> [5,] "1" "Y" "6"
#> [6,] "1" "Y" "9"
\end{Soutput}
\end{Schunk}
%
\section[Collecting Gameday data with pitchRx]{Collecting Gameday data with \pkg{pitchRx}}
The main scraping function in \pkg{pitchRx}, \code{scrape}, can
be used to easily obtain data from the files listed in Table~\ref{table:pitchfx}.
In fact, any combination of these files can be queried using the \code{suffix}
argument. In the example below, the \code{start} and \code{end}
arguments are also used so that all available file types for June
1st, 2011 are queried.
%
\begin{Schunk}
\begin{Sinput}
library(pitchRx)
files <- c("inning/inning_all.xml", "inning/inning_hit.xml",
"miniscoreboard.xml", "players.xml")
dat <- scrape(start = "2011-06-01", end = "2011-06-01", suffix = files)
\end{Sinput}
\end{Schunk}
%
The \code{game.ids} option can be used instead of \code{start} and
\code{end} to obtain an equivalent \code{dat} object. This option
can be useful if the user wants to query specific games rather than
all games played over a particular time span. When using this \code{game.ids}
option, the built-in \code{gids} object, is quite convenient.
%
\begin{Schunk}
\begin{Sinput}
data(gids, package = "pitchRx")
gids11 <- gids[grep("2011_06_01", gids)]
head(gids11)
\end{Sinput}
\begin{Soutput}
#> [1] "gid_2011_06_01_anamlb_kcamlb_1" "gid_2011_06_01_balmlb_seamlb_1"
#> [3] "gid_2011_06_01_chamlb_bosmlb_1" "gid_2011_06_01_clemlb_tormlb_1"
#> [5] "gid_2011_06_01_colmlb_lanmlb_1" "gid_2011_06_01_flomlb_arimlb_1"
\end{Soutput}
\end{Schunk}
%
%
\begin{Schunk}
\begin{Sinput}
dat <- scrape(game.ids = gids11, suffix = files)
\end{Sinput}
\end{Schunk}
%
The object \code{dat} is a list of data frames containing all data
available for June 1st, 2011 using \code{scrape}. The list names
match the table names provided in Table~\ref{table:pitchfx}. For
example, \code{dat\$atbat} is data frame with every at bat on June
1st, 2011 and \code{dat\$pitch} has information related to the outcome
of each pitch (including PITCHf/x parameters). The \code{object.size}
of \code{dat} is nearly 300MB. Multiplying this number by 100 days
exceeds the memory of most machines. Thus, if a large amount of data
is required, the user should exploit the R database interface \citep{DBI}.
\section{Storing and querying Gameday data}
Since PITCHf/x data can easily exhaust memory, one should consider
establishing a database instance before using \code{scrape}. By passing
a database connection to the \code{connect} argument, \code{scrape}
will try to create (and/or append to existing) tables using that connection.
If the connection fails for some reason, tables will be written as
csv files in the current working directory. The benefits of using
the \code{connect} argument includes improved memory management which
can greatly reduce run time. \code{connect} will support a MySQL
connection, but creating a SQLite database is quite easy with \CRANpkg{dplyr}
\citep{dplyr}.
%
\begin{Schunk}
\begin{Sinput}
library(dplyr)
db <- src_sqlite("GamedayDB.sqlite3", create = TRUE)
# Collect and store all PITCHf/x data from 2008 to now
scrape(start = "2008-01-01", end = Sys.Date(),
suffix = "inning/inning_all.xml", connect = db$con)
\end{Sinput}
\end{Schunk}
%
In the later sections, animations of four-seam and cut fastballs thrown
by Mariano Rivera and Phil Hughes during the 2011 season are created.
In order to obtain the data for those animations, one could query
\code{db} which now has PITCHf/x data from 2008 to date. This query
requires criteria on: the \code{pitcher\_name} field (in the \code{atbat}
table), the \code{pitch\_type} field (in the \code{pitch} table),
and the \code{date} field (in both tables). To reduce the time required
to search those records, one should create an index on each of these
three fields.
%
\begin{Schunk}
\begin{Sinput}
library(DBI)
dbSendQuery(db$con, "CREATE INDEX pitcher_index ON atbat(pitcher_name)")
dbSendQuery(db$con, "CREATE INDEX type_index ON pitch(pitch_type)")
dbSendQuery(db$con, "CREATE INDEX date_atbat ON atbat(date)")
\end{Sinput}
\end{Schunk}
%
As a part of our query, we'll have to join the \code{atbat} table
together with the \code{pitch} table. For this task, the \code{gameday\_link}
and \code{num} fields are helpful since together they provide a way
to match pitches with at bats. For this reason, a multi-column index
on the \code{gameday\_link} and \code{num} fields will further reduce
run time of the query.
%
\begin{Schunk}
\begin{Sinput}
dbSendQuery(db$con, 'CREATE INDEX pitch_join ON pitch(gameday_link, num)')
dbSendQuery(db$con, 'CREATE INDEX atbat_join ON atbat(gameday_link, num)')
\end{Sinput}
\end{Schunk}
%
Although the query itself could be expressed entirely in SQL, \pkg{dplyr}'s
grammar for data manipulation (which is database agnostic) can help
to simplify the task. In this case, \code{at.bat} is a tabular \emph{representation}
of the remote \code{atbat} table restricted to cases where Rivera
or Hughes was the pitcher. That is, \code{at.bat} does not contain
the actual data, but it does contain the information necessary to
retrieve it from the database.
%
\begin{Schunk}
\begin{Sinput}
at.bat <- tbl(db, "atbat") %>%
filter(pitcher_name %in% c("Mariano Rivera", "Phil Hughes"))
\end{Sinput}
\end{Schunk}
%
Similarly, \code{fbs} is a tabular representation of the \code{pitch}
table restricted to four-seam (FF) and cut fastballs (FC).
%
\begin{Schunk}
\begin{Sinput}
fbs <- tbl(db, "pitch") %>%
filter(pitch_type %in% c("FF", "FC"))
\end{Sinput}
\end{Schunk}
%
An \code{inner\_join} of these two filtered tables returns a tabular
representation of all four-seam and cut fastballs thrown by Rivera
and Hughes. Before \code{collect} actually performs the database
query and brings the relevant data into the R session, another restriction
is added so that only pitches from 2011 are included.
%
\begin{Schunk}
\begin{Sinput}
pitches <- inner_join(fbs, at.bat) %>%
filter(date >= "2011_01_01" & date <= "2012_01_01") %>%
collect()
\end{Sinput}
\end{Schunk}
%
\section{Visualizing PITCHf/x}
\subsection{Strike-zone plots and umpire bias}
Amongst the most common PITCHf/x graphics are strike-zone plots. Such
a plot has two axes and the coordinates represent the location of
baseballs as they cross home plate. The term strike-zone plot can
refer to either \emph{density} or \emph{probabilistic} plots. Density
plots are useful for exploring what \emph{actually} occurred, but
probabilistic plots can help address much more interesting questions
using statistical inference. Although probabilistic plots can be used
to visually track any event probability across the strike-zone, their
most popular use is for addressing umpire bias in a strike versus
ball decision \citet{bias}. The probabilistic plots section demonstrates
how \pkg{pitchRx} simplifies the process behind creating such plots
via a case study on the impact of home field advantage on umpire decisions.
In the world of sports, it is a common belief that umpires (or referees)
have a tendency to favor the home team. PITCHf/x provides a unique
opportunity to add to this discussion by modeling the probability
of a called strike at home games versus away games. Specifically,
conditioned upon the umpire making a decision at a specific location
in the strike-zone, if the probability that a home pitcher receives
a called strike is higher than the probability that an away pitcher
receives a called strike, then there is evidence to support umpire
bias towards a home pitcher.
There are many different possible outcomes of each pitch, but we can
condition on the umpire making a decision by limiting to the following
two cases. A \dfn{called strike} is an outcome of a pitch where the
batter does not swing and the umpire declares the pitch a strike (which
is a favorable outcome for the pitcher). A \dfn{ball} is another
outcome where the batter does not swing and the umpire declares the
pitch a ball (which is a favorable outcome for the batter). All \code{decisions}
made between 2008 and 2013 can be obtained from \code{db} with the
following query using \pkg{dplyr}.
%
\begin{Schunk}
\begin{Sinput}
# First, add an index on the pitch description to speed up run-time
dbSendQuery(db$con, "CREATE INDEX des_index ON pitch(des)")
pitch <- tbl(db, "pitch") %>%
filter(des %in% c("Called Strike", "Ball")) %>%
# Keep pitch location, descriptions
select(px, pz, des, gameday_link, num) %>%
# 0-1 indicator of strike/ball
mutate(strike = as.numeric(des == "Called Strike"))
atbat <- tbl(db, "atbat") %>%
# Select variables to be used later as covariates in probabilistic models
select(b_height, p_throws, stand, inning_side, date, gameday_link, num)
decisions <- inner_join(pitch, atbat) %>%
filter(date <= "2014_01_01") %>%
collect()
\end{Sinput}
\end{Schunk}
%
\subsubsection{Density plots}
The \code{decisions} data frame contains data on over 2.5 million
pitches thrown from 2008 to 2013. About a third of them are called
strikes and two-thirds balls. Figure~\ref{fig:STRIKES} shows the
density of all called strikes. Clearly, most called strikes occur
on the outer region of the strike-zone. Many factors could contribute
to this phenomenon; which we will not investigate here.
%
\begin{Schunk}
\begin{Sinput}
# strikeFX uses the stand variable to calculate strike-zones
# Here is a slick way to create better facet titles without changing data values
relabel <- function(variable, value) {
value <- sub("^R$", "Right-Handed Batter", value)
sub("^L$", "Left-Handed Batter", value)
}
strikes <- subset(decisions, strike == 1)
strikeFX(strikes, geom = "raster", layer = facet_grid(. ~ stand, labeller = relabel))
\end{Sinput}
\end{Schunk}
%
\begin{figure}[h]
\centerline{\includegraphics[width=0.95\textwidth]{strikes.png}}
\caption{\label{fig:STRIKES} Density of called strikes for right-handed batters
and left-handed batters (from 2008 to 2013).}
\end{figure}
Figure~\ref{fig:STRIKES} shows one static rectangle (or strike-zone)
per plot automatically generated by \code{strikeFX}. The definition
of the strike-zone is notoriously ambiguous. As a result, the boundaries
of the strike-zone may be noticeably different in some situations.
However, we can achieve a fairly accurate representation of strike-zones
using a rectangle defined by batters' average height and stance \citep{Strikezones}.
As Figure~\ref{fig:strike-probs} reinforces, batter stance makes
an important difference since the strike-zone seems to be horizontally
shifted away from the batter. The batter's height is also important
since the strike-zone is classically defined as approximately between
the batter's knees and armpits.
Figure~\ref{fig:STRIKES} has is one strike-zone per plot since the
\code{layer} option contains a \pkg{ggplot2} argument that facets
according to batter stance. Facet layers are a powerful tool for analyzing
PITCHf/x data because they help produce quick and insightful comparisons.
In addition to using the \code{layer} option, one can add layers
to a graphic returned by \code{strikeFX} using \pkg{ggplot2} arithmetic.
It is also worth pointing out that Figure~\ref{fig:STRIKES} could
have been created without introducing the \code{strikes} data frame
by using the \code{density1} and \code{density2} options.
%
\begin{Schunk}
\begin{Sinput}
strikeFX(decisions, geom = "raster", density1 = list(des = "Called Strike"),
density2 = list(des = "Called Strike")) + facet_grid(. ~ stand, labeller = relabel)
\end{Sinput}
\end{Schunk}
%
In general, when \code{density1} and \code{density2} are identical,
the result is equivalent to subsetting the data frame appropriately
beforehand. More importantly, by specifying \emph{different} values
for \code{density1} and \code{density2}, differenced densities are
easily generated. In this case, a grid of density estimates for \code{density2}
are subtracted from the corresponding grid of density estimates for
\code{density1}. Note that the default \code{NULL} value for either
density option infers that the entire data set defines the relevant
density. Thus, if \code{density2} was \code{NULL} (when \code{density1 = list(des = 'Called Strike')}),
we would obtain the density of called strikes minus the density of
\emph{both} called strikes and balls. In Figure~\ref{fig:strikesVSballs},
we define \code{density1} as called strikes and define \code{density2}
as balls. As expected, we see positive density values (in blue) inside
the strike-zone and negative density values (in red) outside of the
strike-zone.
%
\begin{Schunk}
\begin{Sinput}
strikeFX(decisions, geom = "raster", density1 = list(des = "Called Strike"),
density2 = list(des = "Ball"), layer = facet_grid(. ~ stand, labeller = relabel))
\end{Sinput}
\end{Schunk}
%
\begin{figure}[h]
\centerline{\includegraphics[width=0.95\textwidth]{strikesVSballs.pdf}}
\caption{\label{fig:strikesVSballs} Density of called strikes minus density
of balls for both right-handed batters and left-handed batters (from
2008 to 2013). The blue region indicates a higher frequency of called
strikes and the red region indicates a higher frequency of balls.}
\end{figure}
These density plots are helpful for visualizing the observed frequency
of events; however, they are not very useful for addressing our umpire
bias hypothesis. Instead of looking simply at the \emph{density},
we want to model the \emph{probability} of a strike called at each
coordinate given the umpire has to make a decision.
\subsubsection{Probabilistic plots}
There are many approaches to probabilistic modeling over a two dimensional
spatial region. Since our response is often categorical, generalized
additive models (GAMs) is a popular and desirable approach to modeling
events over the strike-zone \citep{loess}. There are numerous R package
implementations of GAMs, but the \code{bam} function from the \CRANpkg{mgcv}
package has several desirable properties \citep{mgcv}. Most importantly,
the smoothing parameter can be estimated using several different methods.
In order to have a reasonable estimate of the smooth 2D surface, GAMs
require fairly large amount of observations. As a result, run time
can be an issue -- especially when modeling 2.5 million observations!
Thankfully, the \code{bam} function has a \code{cluster} argument
which allows one to distribute computations across multiple cores
using the built in \pkg{parallel} package.
%
\begin{Schunk}
\begin{Sinput}
library(parallel)
cl <- makeCluster(detectCores() - 1)
library(mgcv)
m <- bam(strike ~ interaction(stand, p_throws, inning_side) +
s(px, pz, by = interaction(stand, p_throws, inning_side)),
data = decisions, family = binomial(link = 'logit'), cluster = cl)
\end{Sinput}
\end{Schunk}
%
This formula models the probability of a strike as a function of the
baseball's spatial location, the batter's stance, the pitcher's throwing
arm, and the side of the inning. Since home pitchers always pitch
during the top of the inning, \code{inning\_side} also serves as
an indication of whether a pitch is thrown by a home pitcher. In this
case, the \code{interaction} function creates a factor with eight
different levels since every input factor has two levels. Consequently,
there are 8 different levels of smooth surfaces over the spatial region
defined by \code{px} and \code{pz}.
The fitted model \code{m} contains a lot of information which \code{strikeFX}
uses in conjunction with any \pkg{ggplot2} facet commands to infer
which and how surfaces should be plotted. In particular, the \code{var.summary}
is used to identify model covariates, as well their default conditioning
values. In our case, the majority of \code{decisions} are from right-handed
pitchers and the top of the inning. Thus, the default conditioning
values are \code{"top"} for \code{inning\_side} and \code{"R"}
for \code{p\_throws}. If different conditioning values are desired,
\code{var.summary} can be modified accordingly. To demonstrate, Figure~\ref{fig:strike-probs}
shows 2 of the 8 possible surfaces that correspond to a right-handed
\emph{away} pitcher.
%
\begin{Schunk}
\begin{Sinput}
away <- list(inning_side = factor("bottom", levels = c("top", "bottom")))
m$var.summary <- modifyList(m$var.summary, away)
strikeFX(decisions, model = m, layer = facet_grid(. ~ stand, labeller = relabel))
\end{Sinput}
\end{Schunk}
%
\begin{figure}[h]
\centerline{\includegraphics[width=0.95\textwidth]{prob-strike.pdf}}
\caption{\label{fig:strike-probs}Probability that a right-handed away pitcher
receives a called strike (provided the umpire has to make a decision).
Plots are faceted by the handedness of the batter.}
\end{figure}
Using the same intuition exploited earlier to obtain differenced density
plots, we can easily obtain differenced probability plots. To obtain
Figure~\ref{fig:diff-probs}, we simply add \code{p\_throws} as
another facet variable and \code{inning\_side} as a differencing
variable. In this case, conditioning values do not matter since every
one of the 8 surfaces are required in order to produce Figure~\ref{fig:diff-probs}.
%
\begin{Schunk}
\begin{Sinput}
# Function to create better labels for both stand and p_throws
relabel2 <- function(variable, value) {
if (variable %in% "stand")
return(sub("^L$", "Left-Handed Batter",
sub("^R$", "Right-Handed Batter", value)))
if (variable %in% "p_throws")
return(sub("^L$", "Left-Handed Pitcher",
sub("^R$", "Right-Handed Pitcher", value)))
}
strikeFX(decisions, model = m, layer = facet_grid(p_throws ~ stand, labeller = relabel2),
density1 = list(inning_side = "top"), density2 = list(inning_side = "bottom"))
\end{Sinput}
\end{Schunk}
%
\begin{figure}[h]
\centerline{\includegraphics[scale = 1]{prob-diff.pdf}}
\caption{\label{fig:diff-probs}Difference between home and away pitchers in
the probability of a strike (provided the umpire has to make a decision).
The blue regions indicate a higher probability of a strike for home
pitchers and red regions indicate a higher probability of a strike
for away pitchers. Plots are faceted by the handedness of both the
pitcher and the batter.}
\end{figure}
The four different plots in Figure~\ref{fig:diff-probs} represent
the four different combination of values among \code{p\_throws} and
\code{stand}. In general, provided that a pitcher throws to a batter
in the blue region, the pitch is more likely to be called a strike
if the pitcher is on their home turf. Interestingly, there is a well-defined
blue elliptical band around the boundaries of the typical strike-zone.
Thus, home pitchers are more likely to receive a favorable call --
especially when the classification of the pitch is in question. In
some areas, the home pitcher has up to a 6 percent higher probability
of receiving a called strike than an away pitcher. The subtle differences
in spatial patterns across the different values of \code{p\_throws}
and \code{stand} are interesting as well. For instance, pitching
at home has a large positive impact for a left-handed pitcher throwing
in the lower inside portion of the strike-zone to a right-handed batter,
but the impact seems negligible in the mirror opposite case.
Differenced probabilistic densities are clearly an interesting visual
tool for analyzing PITCHf/x data. With \code{strikeFX}, one can quickly
and easily make all sorts of visual comparisons for various situations.
In fact, one can explore and compare the probabilistic structure of
any well-defined event over a strike-zone region (for example, the
probability a batter reaches base) using a similar approach.
\subsection{2D animation}
\code{animateFX} provides convenient and flexible functionality for
animating the trajectory of any desired set of pitches. For demonstration
purposes, this section animates every four-seam and cut fastball thrown
by Mariano Rivera and Phil Hughes during the 2011 season. These pitches
provide a good example of how facets play an important role in extracting
new insights. Similar methods can be used to analyze any MLB player
(or combination of players) in greater detail.
\code{animateFX} tracks three dimensional pitch locations over a
sequence of two dimensional plots. The animation takes on the viewpoint
of the umpire; that is, each time the plot refreshes, the balls are
getting closer to the viewer. This is reflected with the increase
in size of the points as the animation progresses. Obviously, some
pitches travel faster than others, which explains the different sizes
within a particular frame. Animations revert to the initial point
of release once \emph{all} of the baseballs have reached home plate.
During an interactive session, \code{animateFX} produces a series
of plots that may not viewed easily. One option available to the user
is to wrap \code{animation::saveHTML} around \code{animateFX} to
view the animation in a browser with proper looping controls \citep{animation}.
To reduce the time and thinking required to produce these animations,
\code{animateFX} has default settings for the geometry, color, opacity
and size associated with each plot. Any of these assumptions can be
altered - except for the point geometry. In order for animations to
work, a data frame with the appropriately named PITCHf/x parameters
(that is, x0, y0, z0, vx0, vy0, vz0, ax0, ay0 and az0) is required.