Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

home_team_runs and away_team_runs returned NA #17

Closed
vegas31 opened this issue Jun 10, 2014 · 3 comments
Closed

home_team_runs and away_team_runs returned NA #17

vegas31 opened this issue Jun 10, 2014 · 3 comments

Comments

@vegas31
Copy link

vegas31 commented Jun 10, 2014

I am using pitchRx and scrape to look at some data related to what pitches a pitcher uses, given the score of the game. In order to do this, I am looking at the home_team_runs and away_team_runs columns in GameDay data, which pitchRx/scrape provides. However, I am encountering a lot of NA's in my data when the values are actually there, when I search 'home_team_runs' on gd2.mlb.com in the relevant xml file.

Here are my commands:
library(dplyr)
library(pitchRx)
june8 <- scrape(start = "2014-06-08", end = "2014-06-08")

I was mostly interested in the WAS/SDN game, which returned all NA for Jordan Zimmermann; looking at different games on June 8 and also games on different days gives me for the most part the same results -- there are some sporadic entries (see screenshot attached, which are the results of doing a View(june8$atbat))

I am on a OS X 10.9.3, and using pitchRx version 1.5 on R Studio Version 0.98.501.

Happy to pass along any other info you need if I have forgotten anything -- many thanks!

Stuart

screen shot 2014-06-10 at 1 13 26 pm

@cpsievert
Copy link
Owner

That is to be expected -- these values are missing (in the source files) unless runs are scored during the atbat.

I admit this is not the best data format. You probably want the running totals (without NAs).

@vegas31
Copy link
Author

vegas31 commented Jun 10, 2014

Ahh, thanks for the clarification -- I was making a bad assumption about what those fields meant.

The commands you provide work for the most part -- looking at a subset of data (WAS/SDN), it appears it gets the home values correct (here, it's 0 for the entire game), but then it starts adding 1's after a point. I am looking to see if there's a particular reason why it changes over, but haven't found a trend yet.

Thanks again for your help!

@cpsievert
Copy link
Owner

Here is a method to convert home_team_runs/away_team_runs to the equivalent numeric representation.

library(pitchRx)
june8 <- scrape(start = "2014-06-08", end = "2014-06-08")
atbats <- june8$atbat

library(dplyr)
# make sure records are ordered by num (within game)
atbats <- split(atbats, atbats$gameday_link) %>%
            lapply(., function(x) x[order(x$num), ]) %>%
            rbind_all
# replace missing values with the next non-missing value
f <- function(runs) {
  runs <- as.numeric(runs)
  idx <- which(!is.na(runs))
  rep(runs[idx], diff(c(0, idx)))
}
atbats$home_team_runs <- unlist(with(atbats, tapply(home_team_runs, INDEX = gameday_link, f)))
atbats$away_team_runs <- unlist(with(atbats, tapply(away_team_runs, INDEX = gameday_link, f)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants