Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
|Failed to load latest commit information.|
Historical minor league baseball boxscores Prepared and maintained by Chadwick Baseball Bureau (http://www.chadwick-bureau.com) Contact: Dr T L Turocy (firstname.lastname@example.org) ABOUT THIS DATA This package contains transcriptions of historical minor league boxscores. Please read the description below carefully to be sure you understand what these data are (and what they aren't). COPYRIGHT AND LICENSE These files are copyright by Chadwick Baseball Bureau. They are licensed under the Creative Commons Attribution 4.0 International license: https://creativecommons.org/licenses/by/4.0/ The source code to transform the original transcriptions into standardised formats (found in the src/ directory) is copyright by T L Turocy and Chadwick Baseball Bureau. It is licensed under the GNU General Public Licence, version 2.0 (or later, at the user's discretion): https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html DETAILS Historically, the published averages for many minor leagues in baseball omitted players who appeared in only a handful of games ("less-thans"); other leagues never published complete averages at all. In these cases, the only way to document the participation of these players is by capturing a published boxscore. In addition, there are other reasons why having a compilation of game-level data for some historical minor leagues may be of interest. We have developed a simple, text-based format for capturing boxscores efficiently. This format is similar to the structure of a typical newspaper boxscore, allowing transcription of the data, with only a minimum of markup required of the inputter. These files are organised in the transcript/ directory. Each source is included in separate subdirectory. For example, a source might consist of all boxscores found in a particular newspaper in a particular year. There is a parser, in src/convert.py, which takes all of the boxscore transcriptions from a source, and processes the data into CSV files, which are placed into a corresponding directory under processed/. This process does not add or interpolate any new information, but simply extracts and interprets the information found in the original transcriptions. The resulting CSV files are then suitable for further editorial processing. The objective of these files is to render the content of those sources in a way that is as faithful as possible to the originals. It is important to recognise that THE GUIDES AND OTHER SOURCES CONTAIN ERRORS AND INACCURACIES. This collection of files does not attempt to identify and/or propose corrections to those errors. The scope of this collection is to document the contents of sources in a standard and systematic way, and therefore provide the inputs required to editors who wish to produce cleaned, corrected, or improved accounts of the performance data for these leagues. The files in this collection therefore provide one essential component in the chain of evidence required to produce such improved data. PEOPLE NAMES TABLES For each source, a table called people.csv is built. This summarises the number of appearances for each name on each club, including first and last observed dates, and games by position. This table is grouped by name; therefore, if a player appears under more than one spelling of his name (which is not at all uncommon), his performance will be split across multiple rows. Again, making judgments about proper names and identifications is a task that is carried out downstream from this dataset. Each row is given a person.ref identifier. These are eight-character strings of the format LLLLNNTT. The first four characters are the (double) metaphone encoding of the last name of the person, padded out to four characters if necessary by adding 'Z' (as 'Z' is not a letter that is used in metaphone). The digits TT are the total count of names with the same metaphone encoding, among names observed in that player's league. NN is a sequence number, which can range from 01 up to TT. The sequence is generated by sorting (lexicographically) on last name, first name, and then club name. For example, suppose there are four separate entries with the surname Smith, differing by club and/or first name/initial. The metaphone encoding of Smith is SM0, which is padded to SM0Z. The four entries would then have the person.ref values of SM0Z0104, SM0Z0204, SM0Z0304, and SM0Z0404. The order in which they are assigned is determined by the sorting of their first name and club name. This is a deterministic way to assign these identifiers, and therefore the same identifier will always be assigned to the same performance, if the dataset is not changed. Also, if new boxscores are added with new names, this will only affect the person.ref assignments to names with the same metaphone encoding. Picking up the Smith example, suppose a new boxscore is added, and there is one new player, named Baker. Because Baker has the metaphone encoding of PKR, this will not affect the person.ref values given to the Smiths. However, if the new boxscore had a new person named Schmitt, which also has metaphone encoding SM0, the person.ref of the Smith entries would all change. Schmitt would become SM0Z0105 (because there are now 5 rows with SM0, and Schmitt sorts before Smith). Then the four Smiths would be SM0Z0205 through SM0Z0505. The effect of this scheme is to make it possible to collect boxscores incrementally. On the one hand, it should be possible to refer to a row in a stable way. However, as a new spelling of a name comes into the dataset, it may sometimes be the case that it will cause a revision downstream of the identification of the player. It could be, for example, that one of those players listed as Smith really is Schmitt, and the existence of the boxscore with Schmitt leads the researcher to revise the identification. The use of metaphone means these possible reassignments will get flagged only for similar-sounding names; the use of the total count in the identifier ensures identifiers will not get re-used.