Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

home_team_runs and away_team_runs returned NA #17

Closed
vegas31 opened this Issue Jun 10, 2014 · 3 comments

Comments

Projects
None yet
2 participants
@vegas31
Copy link

vegas31 commented Jun 10, 2014

I am using pitchRx and scrape to look at some data related to what pitches a pitcher uses, given the score of the game. In order to do this, I am looking at the home_team_runs and away_team_runs columns in GameDay data, which pitchRx/scrape provides. However, I am encountering a lot of NA's in my data when the values are actually there, when I search 'home_team_runs' on gd2.mlb.com in the relevant xml file.

Here are my commands:
library(dplyr)
library(pitchRx)
june8 <- scrape(start = "2014-06-08", end = "2014-06-08")

I was mostly interested in the WAS/SDN game, which returned all NA for Jordan Zimmermann; looking at different games on June 8 and also games on different days gives me for the most part the same results -- there are some sporadic entries (see screenshot attached, which are the results of doing a View(june8$atbat))

I am on a OS X 10.9.3, and using pitchRx version 1.5 on R Studio Version 0.98.501.

Happy to pass along any other info you need if I have forgotten anything -- many thanks!

Stuart

screen shot 2014-06-10 at 1 13 26 pm

@cpsievert

This comment has been minimized.

Copy link
Owner

cpsievert commented Jun 10, 2014

That is to be expected -- these values are missing (in the source files) unless runs are scored during the atbat.

I admit this is not the best data format. You probably want the running totals (without NAs).

@vegas31

This comment has been minimized.

Copy link
Author

vegas31 commented Jun 10, 2014

Ahh, thanks for the clarification -- I was making a bad assumption about what those fields meant.

The commands you provide work for the most part -- looking at a subset of data (WAS/SDN), it appears it gets the home values correct (here, it's 0 for the entire game), but then it starts adding 1's after a point. I am looking to see if there's a particular reason why it changes over, but haven't found a trend yet.

Thanks again for your help!

@cpsievert cpsievert closed this Jun 10, 2014

@cpsievert

This comment has been minimized.

Copy link
Owner

cpsievert commented Oct 20, 2014

Here is a method to convert home_team_runs/away_team_runs to the equivalent numeric representation.

library(pitchRx)
june8 <- scrape(start = "2014-06-08", end = "2014-06-08")
atbats <- june8$atbat

library(dplyr)
# make sure records are ordered by num (within game)
atbats <- split(atbats, atbats$gameday_link) %>%
            lapply(., function(x) x[order(x$num), ]) %>%
            rbind_all
# replace missing values with the next non-missing value
f <- function(runs) {
  runs <- as.numeric(runs)
  idx <- which(!is.na(runs))
  rep(runs[idx], diff(c(0, idx)))
}
atbats$home_team_runs <- unlist(with(atbats, tapply(home_team_runs, INDEX = gameday_link, f)))
atbats$away_team_runs <- unlist(with(atbats, tapply(away_team_runs, INDEX = gameday_link, f)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.