Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies in Parquet download files #73

Open
JRice15 opened this issue Nov 15, 2023 · 6 comments
Open

Inconsistencies in Parquet download files #73

JRice15 opened this issue Nov 15, 2023 · 6 comments
Labels
fix on upcoming release Bug that is expected to be resolved in conjunction with Retrosheet's next release (2x per year)

Comments

@JRice15
Copy link

JRice15 commented Nov 15, 2023

Describe the bug
I downloaded the parquet files directly from the OneDrive link included in the repo's readme, and have been reading them with pyarrow and pandas. In digging though the data (2022 only, so far), I discovered two problems, one of which led me to the other.

  1. First (and maybe more of a feature request than a bug) is the fact that, best I can tell, there is no easy way to tell whether games are regular season or postseason/allstar/other from any of the parquet files (I would particularly expect game.parquet, gamelog.parquet, or schedule.parquet to have an indicator column for this, but I do not see one).

  2. Second appears to be more of a bug. Contrary to no. 1 above, schedule.parquet only seems to include regular season games. Great, we can simply filter game.parquet and other files by whether or not the game exists in schedule.parquet, right? Nope, schedule.parquet seems to list games that never actually occurred. As an example: schedule.parquet includes a game MIL @ CHN 2022-04-08, not part of a double header. However, games.parquet (as well as baseball-reference and other sources) tell us that no such game exists! MIL @ CHN games occurred on 4/7/22 and 4/9/22, but not 4/8/22. I find 88 of these 'phantom games' in schedule.parquet for 2022. And it's not a byproduct of the pandemic or lockout, I found these inconsistencies in every year I've looked at as far back as 2000.

To Reproduce
Steps to reproduce the behavior:

  1. Compare schedule.parquet and game.parquet as described above.

Expected behavior
Games will be consistent across files, and a column easily delineates whether games are regular season or not.

I am entirely open to the fact that the files are exactly as intended and I am just missing something that explains the discrepancies, let me know if that is the case. Thanks!

@droher
Copy link
Owner

droher commented Nov 15, 2023

  1. See this issue here - Update data to the Summer 2023 Release #68. Game type was not included in the raw files until this summer, and there's no released version of the game parsing software that creates a column out of it. Hopefully I can manage to get it into the game file on the next update. I made a custom CSV that I linked in that issue to hold you over in the meantime. (https://github.com/droher/boxball/files/11984000/game_types.csv)

  2. It sounds like this was an issue for you primarily as a function of 1), but in case you're still interested in using them, the explanation is here: https://www.retrosheet.org/schedule/index.html Those are the original schedules before postponements and rescheduling. A valuable historical dataset to have on its own, but not for this particular purpose.

@JRice15
Copy link
Author

JRice15 commented Nov 15, 2023

  1. Thanks, that csv will work for my purposes. I really appreciate the project!

  2. Maybe a simple README file could be included with the data files that makes clear what each file is representing, so others don't run into the same confusion. The fact that 'postponement_indicator' and 'makeup_dates' columns exist in schedule.parquet and are all-null as far as I can tell made me assume that the problem was not caused of rescheduling, but maybe those columns are incorrect or incomplete.

@droher
Copy link
Owner

droher commented Nov 15, 2023

The readme has a link to the schema doc, perhaps not prominently enough, where you can find the schedule doc. I just pulled the schedule.parquet file down from OneDrive and see plenty of non-null values in those columns, but there is a 2020 bug.

The error is coming from the extract step - 2020 is the only year with two different schedule files, one for the original and one for the revised, and the glob is picking up both. I'll fix that in the code and the data should be correct in the next release. Thanks for the catch! If this is the only 2020-related bug in the data I'll be pleasantly surprised.

@JRice15
Copy link
Author

JRice15 commented Nov 15, 2023

You're right, I amend my statement to: 'postponement_indicator' is all-null in 2021 and 2022 only. When grouping the entire schedules table by year, I see zero postponements for 2021 and 2022, while every other year has at least 20 or so.

@droher
Copy link
Owner

droher commented Nov 15, 2023

Yep, that's definitely a bug - this time from gaps in the data that have since been corrected by Retrosheet. Those years will be filled on the next release (which I'm guessing I'll have up here in December).

@JRice15
Copy link
Author

JRice15 commented Nov 15, 2023

Great, thanks for your diligence and quick responses!

@droher droher added the fix on upcoming release Bug that is expected to be resolved in conjunction with Retrosheet's next release (2x per year) label Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix on upcoming release Bug that is expected to be resolved in conjunction with Retrosheet's next release (2x per year)
Projects
None yet
Development

No branches or pull requests

2 participants