<b>*Disclaimer - this section is a work in progress</b>
<br></br>
This page will take you through the data sources and methodologies employed in this specific project. Furthermore, you can find brief descriptions/images/tables of the various datasets mentioned. Data must be acquired using at least one Python API and one R API. This project will use various data formats that may include labeled data, qualitative data, text data, geo data, record-data, etc. 

# Baseballr

"Baseballr" is a package in R that focuses on baseball analytics, also known as sabremetrics. It includes various functions that can be used for scraping data from websites like FanGraphs.com, Baseball-Reference.com, and BaseballSavant.mlb.com. It also includes functions for calculating specific baseball metrics such as wOBA (weighted on-base average) and FIP (fielding independent pitching). I will mainly use this package to gather data (which uses an API as can be seen below).

### Source Code

The below source code was pulled from the baseballr github repisotry. This specific code uses a mlb api to acquire play-by-play data for a specific game. I will use these functions later on through the baseballr package.

In [2]:
library(tidyverse)

-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mdplyr    [39m 1.1.2     [32mv[39m [34mreadr    [39m 2.1.4
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mstringr  [39m 1.5.0
[32mv[39m [34mggplot2  [39m 3.4.2     [32mv[39m [34mtibble   [39m 3.2.1
[32mv[39m [34mlubridate[39m 1.9.2     [32mv[39m [34mtidyr    [39m 1.3.0
[32mv[39m [34mpurrr    [39m 1.0.1     
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [3]:
#| code-fold: true

mlb_api_call <- function(url){
  res <-
    httr::RETRY("GET", url)
  
  json <- res$content %>%
    rawToChar() %>%
    jsonlite::fromJSON(simplifyVector = T)
  
  return(json)
}

mlb_stats_endpoint <- function(endpoint){
  all_endpoints = c(
    "v1/attendance",#
    "v1/conferences",#
    "v1/conferences/{conferenceId}",#
    "v1/awards/{awardId}/recipients",#
    "v1/awards",#
    "v1/baseballStats",#
    "v1/eventTypes",#
    "v1/fielderDetailTypes",#
    "v1/gameStatus",#
    "v1/gameTypes",#
    "v1/highLow/types",#
    "v1/hitTrajectories",#
    "v1/jobTypes",#
    "v1/languages",
    "v1/leagueLeaderTypes",#
    "v1/logicalEvents",#
    "v1/metrics",#
    "v1/pitchCodes",#
    "v1/pitchTypes",#
    "v1/playerStatusCodes",#
    "v1/positions",#
    "v1/reviewReasons",#
    "v1/rosterTypes",#
    "v1/runnerDetailTypes",#
    "v1/scheduleEventTypes",#
    "v1/situationCodes",#
    "v1/sky",#
    "v1/standingsTypes",#
    "v1/statGroups",#
    "v1/statTypes",#
    "v1/windDirection",#
    "v1/divisions",#
    "v1/draft/{year}",#
    "v1/draft/prospects/{year}",#
    "v1/draft/{year}/latest",#
    "v1.1/game/{gamePk}/feed/live",
    "v1.1/game/{gamePk}/feed/live/diffPatch",#
    "v1.1/game/{gamePk}/feed/live/timestamps",#
    "v1/game/changes",##x
    "v1/game/analytics/game",##x
    "v1/game/analytics/guids",##x
    "v1/game/{gamePk}/guids",##x
    "v1/game/{gamePk}/{GUID}/analytics",##x
    "v1/game/{gamePk}/{GUID}/contextMetricsAverages",##x
    "v1/game/{gamePk}/contextMetrics",#
    "v1/game/{gamePk}/winProbability",#
    "v1/game/{gamePk}/boxscore",#
    "v1/game/{gamePk}/content",#
    "v1/game/{gamePk}/feed/color",##x
    "v1/game/{gamePk}/feed/color/diffPatch",##x
    "v1/game/{gamePk}/feed/color/timestamps",##x
    "v1/game/{gamePk}/linescore",#
    "v1/game/{gamePk}/playByPlay",#
    "v1/gamePace",#
    "v1/highLow/{orgType}",#
    "v1/homeRunDerby/{gamePk}",#
    "v1/homeRunDerby/{gamePk}/bracket",#
    "v1/homeRunDerby/{gamePk}/pool",#
    "v1/league",#
    "v1/league/{leagueId}/allStarBallot",#
    "v1/league/{leagueId}/allStarWriteIns",#
    "v1/league/{leagueId}/allStarFinalVote",#
    "v1/people",#
    "v1/people/freeAgents",#
    "v1/people/{personId}",##U
    "v1/people/{personId}/stats/game/{gamePk}",#
    "v1/people/{personId}/stats/game/current",#
    "v1/jobs",#
    "v1/jobs/umpires",#
    "v1/jobs/datacasters",#
    "v1/jobs/officialScorers",#
    "v1/jobs/umpires/games/{umpireId}",##x
    "v1/schedule/",#
    "v1/schedule/games/tied",#
    "v1/schedule/postseason",#
    "v1/schedule/postseason/series",#
    "v1/schedule/postseason/tuneIn",##x
    "v1/seasons",#
    "v1/seasons/all",#
    "v1/seasons/{seasonId}",#
    "v1/sports",#
    "v1/sports/{sportId}",#
    "v1/sports/{sportId}/players",#
    "v1/standings",#
    "v1/stats",#
    "v1/stats/metrics",##x
    "v1/stats/leaders",#
    "v1/stats/streaks",##404
    "v1/teams",#
    "v1/teams/history",#
    "v1/teams/stats",#
    "v1/teams/stats/leaders",#
    "v1/teams/affiliates",#
    "v1/teams/{teamId}",#
    "v1/teams/{teamId}/stats",#
    "v1/teams/{teamId}/affiliates",#
    "v1/teams/{teamId}/alumni",#
    "v1/teams/{teamId}/coaches",#
    "v1/teams/{teamId}/personnel",#
    "v1/teams/{teamId}/leaders",#
    "v1/teams/{teamId}/roster",##x
    "v1/teams/{teamId}/roster/{rosterType}",#
    "v1/venues"#
  )
  base_url = glue::glue('http://statsapi.mlb.com/api/{endpoint}')
  return(base_url)
}



In [31]:
x <- "http://statsapi.mlb.com/api/v1/game/575156/playByPlay"

output <- mlb_api_call(x)

"output" is a very messy list that is extremely long. Instead of printing "output", below are three images of part of the list.

![Figure 1](./images/baseballr_example1.png)
![Figure 2](./images/baseballr_example2.png)
![Figure 3](./images/baseballr_example3.png)

The below code builds on the previous code, returning a tibble that includes over 100 columns of data provided by the MLB Stats API at a pitch level. As you can see, the output is much cleaner and easier to work with. 

In [16]:
#| code-fold: true

#' @rdname mlb_pbp
#' @title **Acquire pitch-by-pitch data for Major and Minor League games**
#'
#' @param game_pk The date for which you want to find game_pk values for MLB games
#' @importFrom jsonlite fromJSON
#' @return Returns a tibble that includes over 100 columns of data provided
#' by the MLB Stats API at a pitch level.
#'
#' Some data will vary depending on the
#' park and the league level, as most sensor data is not available in
#' minor league parks via this API. Note that the column names have mostly
#' been left as-is and there are likely duplicate columns in terms of the
#' information they provide. I plan to clean the output up down the road, but
#' for now I am leaving the majority as-is.
#'
#' Both major and minor league pitch-by-pitch data can be pulled with this function.
#' 
#'  |col_name                       |types     |
#'  |:------------------------------|:---------|
#'  |game_pk                        |numeric   |
#'  |game_date                      |character |
#'  |index                          |integer   |
#'  |startTime                      |character |
#'  |endTime                        |character |
#'  |isPitch                        |logical   |
#'  |type                           |character |
#'  |playId                         |character |
#'  |pitchNumber                    |integer   |
#'  |details.description            |character |
#'  |details.event                  |character |
#'  |details.awayScore              |integer   |
#'  |details.homeScore              |integer   |
#'  |details.isScoringPlay          |logical   |
#'  |details.hasReview              |logical   |
#'  |details.code                   |character |
#'  |details.ballColor              |character |
#'  |details.isInPlay               |logical   |
#'  |details.isStrike               |logical   |
#'  |details.isBall                 |logical   |
#'  |details.call.code              |character |
#'  |details.call.description       |character |
#'  |count.balls.start              |integer   |
#'  |count.strikes.start            |integer   |
#'  |count.outs.start               |integer   |
#'  |player.id                      |integer   |
#'  |player.link                    |character |
#'  |pitchData.strikeZoneTop        |numeric   |
#'  |pitchData.strikeZoneBottom     |numeric   |
#'  |details.fromCatcher            |logical   |
#'  |pitchData.coordinates.x        |numeric   |
#'  |pitchData.coordinates.y        |numeric   |
#'  |hitData.trajectory             |character |
#'  |hitData.hardness               |character |
#'  |hitData.location               |character |
#'  |hitData.coordinates.coordX     |numeric   |
#'  |hitData.coordinates.coordY     |numeric   |
#'  |actionPlayId                   |character |
#'  |details.eventType              |character |
#'  |details.runnerGoing            |logical   |
#'  |position.code                  |character |
#'  |position.name                  |character |
#'  |position.type                  |character |
#'  |position.abbreviation          |character |
#'  |battingOrder                   |character |
#'  |atBatIndex                     |character |
#'  |result.type                    |character |
#'  |result.event                   |character |
#'  |result.eventType               |character |
#'  |result.description             |character |
#'  |result.rbi                     |integer   |
#'  |result.awayScore               |integer   |
#'  |result.homeScore               |integer   |
#'  |about.atBatIndex               |integer   |
#'  |about.halfInning               |character |
#'  |about.inning                   |integer   |
#'  |about.startTime                |character |
#'  |about.endTime                  |character |
#'  |about.isComplete               |logical   |
#'  |about.isScoringPlay            |logical   |
#'  |about.hasReview                |logical   |
#'  |about.hasOut                   |logical   |
#'  |about.captivatingIndex         |integer   |
#'  |count.balls.end                |integer   |
#'  |count.strikes.end              |integer   |
#'  |count.outs.end                 |integer   |
#'  |matchup.batter.id              |integer   |
#'  |matchup.batter.fullName        |character |
#'  |matchup.batter.link            |character |
#'  |matchup.batSide.code           |character |
#'  |matchup.batSide.description    |character |
#'  |matchup.pitcher.id             |integer   |
#'  |matchup.pitcher.fullName       |character |
#'  |matchup.pitcher.link           |character |
#'  |matchup.pitchHand.code         |character |
#'  |matchup.pitchHand.description  |character |
#'  |matchup.splits.batter          |character |
#'  |matchup.splits.pitcher         |character |
#'  |matchup.splits.menOnBase       |character |
#'  |batted.ball.result             |factor    |
#'  |home_team                      |character |
#'  |home_level_id                  |integer   |
#'  |home_level_name                |character |
#'  |home_parentOrg_id              |integer   |
#'  |home_parentOrg_name            |character |
#'  |home_league_id                 |integer   |
#'  |home_league_name               |character |
#'  |away_team                      |character |
#'  |away_level_id                  |integer   |
#'  |away_level_name                |character |
#'  |away_parentOrg_id              |integer   |
#'  |away_parentOrg_name            |character |
#'  |away_league_id                 |integer   |
#'  |away_league_name               |character |
#'  |batting_team                   |character |
#'  |fielding_team                  |character |
#'  |last.pitch.of.ab               |character |
#'  |pfxId                          |character |
#'  |details.trailColor             |character |
#'  |details.type.code              |character |
#'  |details.type.description       |character |
#'  |pitchData.startSpeed           |numeric   |
#'  |pitchData.endSpeed             |numeric   |
#'  |pitchData.zone                 |integer   |
#'  |pitchData.typeConfidence       |numeric   |
#'  |pitchData.plateTime            |numeric   |
#'  |pitchData.extension            |numeric   |
#'  |pitchData.coordinates.aY       |numeric   |
#'  |pitchData.coordinates.aZ       |numeric   |
#'  |pitchData.coordinates.pfxX     |numeric   |
#'  |pitchData.coordinates.pfxZ     |numeric   |
#'  |pitchData.coordinates.pX       |numeric   |
#'  |pitchData.coordinates.pZ       |numeric   |
#'  |pitchData.coordinates.vX0      |numeric   |
#'  |pitchData.coordinates.vY0      |numeric   |
#'  |pitchData.coordinates.vZ0      |numeric   |
#'  |pitchData.coordinates.x0       |numeric   |
#'  |pitchData.coordinates.y0       |numeric   |
#'  |pitchData.coordinates.z0       |numeric   |
#'  |pitchData.coordinates.aX       |numeric   |
#'  |pitchData.breaks.breakAngle    |numeric   |
#'  |pitchData.breaks.breakLength   |numeric   |
#'  |pitchData.breaks.breakY        |numeric   |
#'  |pitchData.breaks.spinRate      |integer   |
#'  |pitchData.breaks.spinDirection |integer   |
#'  |hitData.launchSpeed            |numeric   |
#'  |hitData.launchAngle            |numeric   |
#'  |hitData.totalDistance          |numeric   |
#'  |injuryType                     |character |
#'  |umpire.id                      |integer   |
#'  |umpire.link                    |character |
#'  |isBaseRunningPlay              |logical   |
#'  |isSubstitution                 |logical   |
#'  |about.isTopInning              |logical   |
#'  |matchup.postOnFirst.id         |integer   |
#'  |matchup.postOnFirst.fullName   |character |
#'  |matchup.postOnFirst.link       |character |
#'  |matchup.postOnSecond.id        |integer   |
#'  |matchup.postOnSecond.fullName  |character |
#'  |matchup.postOnSecond.link      |character |
#'  |matchup.postOnThird.id         |integer   |
#'  |matchup.postOnThird.fullName   |character |
#'  |matchup.postOnThird.link       |character |
#' @export
#' @examples \donttest{
#'   try(mlb_pbp(game_pk = 632970))
#' }

mlb_pbp <- function(game_pk) {
  
  mlb_endpoint <- mlb_stats_endpoint(glue::glue("v1.1/game/{game_pk}/feed/live"))
  
  tryCatch(
    expr = {
      payload <- mlb_endpoint %>% 
        mlb_api_call() %>% 
        jsonlite::toJSON() %>% 
        jsonlite::fromJSON(flatten = TRUE)
      
      plays <- payload$liveData$plays$allPlays$playEvents %>% 
        dplyr::bind_rows()
      
      at_bats <- payload$liveData$plays$allPlays
      
      current <- payload$liveData$plays$currentPlay
      
      game_status <- payload$gameData$status$abstractGameState
      
      home_team <- payload$gameData$teams$home$name
      
      home_level <- payload$gameData$teams$home$sport
      
      home_league <- payload$gameData$teams$home$league
      
      away_team <- payload$gameData$teams$away$name
      
      away_level <- payload$gameData$teams$away$sport
      
      away_league <- payload$gameData$teams$away$league
      
      columns <- lapply(at_bats, function(x) class(x)) %>%
        dplyr::bind_rows(.id = "variable")
      cols <- c(colnames(columns))
      classes <- c(t(unname(columns[1,])))
      
      df <- data.frame(cols, classes)
      list_columns <- df %>%
        dplyr::filter(.data$classes == "list") %>%
        dplyr::pull("cols")
      
      at_bats <- at_bats %>%
        dplyr::select(-c(tidyr::one_of(list_columns)))
      
      pbp <- plays %>%
        dplyr::left_join(at_bats, by = c("endTime" = "playEndTime"))
      
      pbp <- pbp %>%
        tidyr::fill("atBatIndex":"matchup.splits.menOnBase", .direction = "up") %>%
        dplyr::mutate(
          game_pk = game_pk,
          game_date = substr(payload$gameData$datetime$dateTime, 1, 10)) %>%
        dplyr::select("game_pk", "game_date", tidyr::everything())
      
      pbp <- pbp %>%
        dplyr::mutate(
          matchup.batter.fullName = factor(.data$matchup.batter.fullName),
          matchup.pitcher.fullName = factor(.data$matchup.pitcher.fullName),
          atBatIndex = factor(.data$atBatIndex)
          # batted.ball.result = case_when(!result.event %in% c(
          #   "Single", "Double", "Triple", "Home Run") ~ "Out/Other",
          #   TRUE ~ result.event),
          # batted.ball.result = factor(batted.ball.result,
          #                             levels = c("Single", "Double", "Triple", "Home Run", "Out/Other"))
        ) %>%
        dplyr::mutate(
          home_team = home_team,
          home_level_id = home_level$id,
          home_level_name = home_level$name,
          home_parentOrg_id = payload$gameData$teams$home$parentOrgId,
          home_parentOrg_name = payload$gameData$teams$home$parentOrgName,
          home_league_id = home_league$id,
          home_league_name = home_league$name,
          away_team = away_team,
          away_level_id = away_level$id,
          away_level_name = away_level$name,
          away_parentOrg_id = payload$gameData$teams$away$parentOrgId,
          away_parentOrg_name = payload$gameData$teams$away$parentOrgName,
          away_league_id = away_league$id,
          away_league_name = away_league$name,
          batting_team = factor(ifelse(.data$about.halfInning == "bottom",
                                       .data$home_team,
                                       .data$away_team)),
          fielding_team = factor(ifelse(.data$about.halfInning == "bottom",
                                        .data$away_team,
                                        .data$home_team)))
      pbp <- pbp %>%
        dplyr::arrange(desc(.data$atBatIndex), desc(.data$pitchNumber))
      
      pbp <- pbp %>%
        dplyr::group_by(.data$atBatIndex) %>%
        dplyr::mutate(
          last.pitch.of.ab =  ifelse(.data$pitchNumber == max(.data$pitchNumber), "true", "false"),
          last.pitch.of.ab = factor(.data$last.pitch.of.ab)) %>%
        dplyr::ungroup()
      
      pbp <- dplyr::bind_rows(baseballr::stats_api_live_empty_df, pbp)
      
      check_home_level <- pbp %>%
        dplyr::distinct(.data$home_level_id) %>%
        dplyr::pull()
      
      # this will need to be updated in the future to properly estimate X,Z coordinates at the minor league level
      
      # if(check_home_level != 1) {
      #
      #   pbp <- pbp %>%
      #     dplyr::mutate(pitchData.coordinates.x = -pitchData.coordinates.x,
      #                   pitchData.coordinates.y = -pitchData.coordinates.y)
      #
      #   pbp <- pbp %>%
      #     dplyr::mutate(pitchData.coordinates.pX_est = predict(x_model, pbp),
      #                   pitchData.coordinates.pZ_est = predict(y_model, pbp))
      #
      #   pbp <- pbp %>%
      #     dplyr::mutate(pitchData.coordinates.x = -pitchData.coordinates.x,
      #                   pitchData.coordinates.y = -pitchData.coordinates.y)
      # }
      
      pbp <- pbp %>%
        dplyr::rename(
          "count.balls.start" = "count.balls.x",
          "count.strikes.start" = "count.strikes.x",
          "count.outs.start" = "count.outs.x",
          "count.balls.end" = "count.balls.y",
          "count.strikes.end" = "count.strikes.y",
          "count.outs.end" = "count.outs.y") %>%
        make_baseballr_data("MLB Play-by-Play data from MLB.com",Sys.time())
    },
    error = function(e) {
      message(glue::glue("{Sys.time()}: Invalid arguments provided"))
    },
    finally = {
    }
  ) 
  return(pbp)
}

#' @rdname get_pbp_mlb
#' @title **(legacy) Acquire pitch-by-pitch data for Major and Minor League games**
#' @inheritParams mlb_pbp
#' @return Returns a tibble that includes over 100 columns of data provided
#' by the MLB Stats API at a pitch level.
#' @keywords legacy
#' @export
# get_pbp_mlb <- mlb_pbp

Here is an example using the mlb_pbp function.

In [17]:
example <- (mlb_pbp(575156))
head(example)

2023-10-12 13:40:46.684707: Invalid arguments provided



game_pk,game_date,index,startTime,endTime,isPitch,type,playId,pitchNumber,details.description,...,about.isTopInning,matchup.postOnFirst.id,matchup.postOnFirst.fullName,matchup.postOnFirst.link,matchup.postOnSecond.id,matchup.postOnSecond.fullName,matchup.postOnSecond.link,matchup.postOnThird.id,matchup.postOnThird.fullName,matchup.postOnThird.link
<dbl>,<chr>,<int>,<chr>,<chr>,<lgl>,<chr>,<chr>,<int>,<chr>,...,<lgl>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>
575156,2019-06-01,5,2019-06-01T15:38:42.000Z,2019-06-01T19:38:07.354Z,True,pitch,05751566-0846-0063-000c-f08cd117d70a,6,"In play, out(s)",...,True,,,,,,,,,
575156,2019-06-01,4,2019-06-01T15:38:19.000Z,2019-06-01T15:38:42.000Z,True,pitch,05751566-0846-0053-000c-f08cd117d70a,5,Foul,...,True,,,,,,,,,
575156,2019-06-01,3,2019-06-01T15:38:02.000Z,2019-06-01T15:38:19.000Z,True,pitch,05751566-0846-0043-000c-f08cd117d70a,4,Swinging Strike,...,True,,,,,,,,,
575156,2019-06-01,2,2019-06-01T15:37:45.000Z,2019-06-01T15:38:02.000Z,True,pitch,05751566-0846-0033-000c-f08cd117d70a,3,Swinging Strike,...,True,,,,,,,,,
575156,2019-06-01,1,2019-06-01T15:37:31.000Z,2019-06-01T15:37:45.000Z,True,pitch,05751566-0846-0023-000c-f08cd117d70a,2,Ball,...,True,,,,,,,,,
575156,2019-06-01,0,2019-06-01T15:37:15.000Z,2019-06-01T15:37:31.000Z,True,pitch,05751566-0846-0013-000c-f08cd117d70a,1,Ball,...,True,,,,,,,,,


### Acquiring Data

I will pull more data eventually, but for now I am taking two series of games from the 2023 season. 

In [38]:
library(baseballr)

The below code allows me to find the correct game_pk values that I can then use to pull play-by-play data.

In [33]:
mlb_game_pks("2023-06-25")
# mlb_game_pks("2023-06-24")
# mlb_game_pks("2023-06-23")

game_pk,gameGuid,link,gameType,season,gameDate,officialDate,isTie,gameNumber,publicFacing,...,teams.home.leagueRecord.wins,teams.home.leagueRecord.losses,teams.home.leagueRecord.pct,teams.home.team.id,teams.home.team.name,teams.home.team.link,venue.id,venue.name,venue.link,content.link
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<int>,<lgl>,...,<int>,<int>,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
717623,e33c1a00-85cd-46cd-9444-a8f79a5e3846,/api/v1.1/game/717623/feed/live,R,2023,2023-06-25T14:10:00Z,2023-06-25,False,1,True,...,32,45,0.416,138,St. Louis Cardinals,/api/v1/teams/138,5381,London Stadium,/api/v1/venues/5381,/api/v1/game/717623/content
717621,ac54164c-0dcc-4575-8425-9ec93a0bf1ad,/api/v1.1/game/717621/feed/live,R,2023,2023-06-25T16:10:00Z,2023-06-25,False,1,True,...,33,43,0.434,116,Detroit Tigers,/api/v1/teams/116,2394,Comerica Park,/api/v1/venues/2394,/api/v1/game/717621/content
717627,919ec743-ee10-4515-a193-cac2085ad0dc,/api/v1.1/game/717627/feed/live,R,2023,2023-06-25T17:35:00Z,2023-06-25,False,1,True,...,47,29,0.618,110,Baltimore Orioles,/api/v1/teams/110,2,Oriole Park at Camden Yards,/api/v1/venues/2,/api/v1/game/717627/content
717624,b1cfe254-fa22-475c-8edc-e8e16b35e912,/api/v1.1/game/717624/feed/live,R,2023,2023-06-25T17:35:00Z,2023-06-25,False,1,True,...,40,37,0.519,143,Philadelphia Phillies,/api/v1/teams/143,2681,Citizens Bank Park,/api/v1/venues/2681,/api/v1/game/717624/content
717622,be439c03-189a-461c-99b5-456993e5a967,/api/v1.1/game/717622/feed/live,R,2023,2023-06-25T17:35:00Z,2023-06-25,False,1,True,...,43,35,0.551,147,New York Yankees,/api/v1/teams/147,3313,Yankee Stadium,/api/v1/venues/3313,/api/v1/game/717622/content
717617,0d6be3aa-5772-48ad-a201-61a50debb989,/api/v1.1/game/717617/feed/live,R,2023,2023-06-25T17:37:00Z,2023-06-25,False,1,True,...,43,36,0.544,141,Toronto Blue Jays,/api/v1/teams/141,14,Rogers Centre,/api/v1/venues/14,/api/v1/game/717617/content
717619,513d829b-39b8-4167-87b1-49d8805147fc,/api/v1.1/game/717619/feed/live,R,2023,2023-06-25T17:40:00Z,2023-06-25,False,1,True,...,54,27,0.667,139,Tampa Bay Rays,/api/v1/teams/139,12,Tropicana Field,/api/v1/venues/12,/api/v1/game/717619/content
717618,eb75479b-3b52-4944-9d9f-6503ebda4935,/api/v1.1/game/717618/feed/live,R,2023,2023-06-25T17:40:00Z,2023-06-25,False,1,True,...,41,37,0.526,113,Cincinnati Reds,/api/v1/teams/113,2602,Great American Ball Park,/api/v1/venues/2602,/api/v1/game/717618/content
717620,cffb040e-7b40-488f-b0e2-5f7bc735ef3e,/api/v1.1/game/717620/feed/live,R,2023,2023-06-25T17:40:00Z,2023-06-25,False,1,True,...,45,34,0.57,146,Miami Marlins,/api/v1/teams/146,4169,loanDepot park,/api/v1/venues/4169,/api/v1/game/717620/content
717613,f2df18cb-1bf3-4490-8b3e-a3edeaeff769,/api/v1.1/game/717613/feed/live,R,2023,2023-06-25T17:40:00Z,2023-06-25,False,1,True,...,37,40,0.481,114,Cleveland Guardians,/api/v1/teams/114,5,Progressive Field,/api/v1/venues/5,/api/v1/game/717613/content


In [None]:
#game_pk values

#diamondbacks/giants - 717641, 717639, 717612

#mariners/orioles - 717651, 717628, 717627

In [37]:
x <- c(717641, 717639, 717612, 717651, 717628, 717627)
result <- lapply(x, mlb_pbp)
combined_tibble <- bind_rows(result)
# Save the data to a CSV file
write.csv(combined_tibble, file = "./data/raw_data/baseballr_six_games.csv", row.names = FALSE)
head(combined_tibble)

game_pk,game_date,index,startTime,endTime,isPitch,type,playId,pitchNumber,details.description,...,matchup.postOnThird.link,reviewDetails.isOverturned,reviewDetails.inProgress,reviewDetails.reviewType,reviewDetails.challengeTeamId,base,details.violation.type,details.violation.description,details.violation.player.id,details.violation.player.fullName
<dbl>,<chr>,<int>,<chr>,<chr>,<lgl>,<chr>,<chr>,<int>,<chr>,...,<chr>,<lgl>,<lgl>,<chr>,<int>,<int>,<chr>,<chr>,<int>,<chr>
717641,2023-06-24,2,2023-06-24T04:40:41.468Z,2023-06-24T04:40:49.543Z,True,pitch,a8483d6b-3cff-4190-827c-1b4c71f60ef8,3,"In play, out(s)",...,,,,,,,,,,
717641,2023-06-24,1,2023-06-24T04:40:24.685Z,2023-06-24T04:40:28.580Z,True,pitch,49eba946-3aaa-4260-895b-3de29cb49043,2,Foul,...,,,,,,,,,,
717641,2023-06-24,0,2023-06-24T04:40:08.036Z,2023-06-24T04:40:12.278Z,True,pitch,f879f5a0-8570-4594-ae73-3f09d1a53ee1,1,Ball,...,,,,,,,,,,
717641,2023-06-24,6,2023-06-24T04:39:08.422Z,2023-06-24T04:39:16.691Z,True,pitch,3077f596-0221-4469-9841-f1684c629288,6,"In play, out(s)",...,,,,,,,,,,
717641,2023-06-24,5,2023-06-24T04:38:49.567Z,2023-06-24T04:38:53.482Z,True,pitch,21a33e9d-e596-408b-9168-141acc0b1b63,5,Foul,...,,,,,,,,,,
717641,2023-06-24,4,2023-06-24T04:38:32.110Z,2023-06-24T04:38:36.156Z,True,pitch,db083639-52be-41f4-b6d9-f72601ef1508,4,Foul,...,,,,,,,,,,


# ncaahoopR

"ncaahoopR" is an R package tailored for NCAA Basketball Play-by-Play Data analysis. It excels at retrieving play-by-play data in a tidy format. For the purposes of this project, I will start by scraping play-by-play data for the Villanova Wildcats Men's Basketball team from both the 2019-20 and 2021-22 seasons (the 2020-21 was shortened due to COVID-19).

In [27]:
install.packages("devtools")
devtools::install_github("lbenz730/ncaahoopR")
library(ncaahoopR)



The downloaded binary packages are in
	/var/folders/lb/dk54cbx965z7nj61zps2fzr00000gn/T//RtmpqXinFj/downloaded_packages


Skipping install of 'ncaahoopR' from a github remote, the SHA1 (9bd97fec) has not changed since last install.
  Use `force = TRUE` to force installation



In [None]:
Villanova1920 <- get_pbp("Villanova", "2019-20")
Villanova2122 <- get_pbp("Villanova", "2021-22")
write.csv(Villanova1920, file = "./data/raw_data/villanova1920.csv", row.names = FALSE)
write.csv(Villanova2122, file = "./data/raw_data/villanova2122.csv", row.names = FALSE)

In [21]:
head(Villanova1920)

Unnamed: 0_level_0,game_id,date,home,away,play_id,half,time_remaining_half,secs_remaining,secs_remaining_absolute,description,...,shot_y,shot_team,shot_outcome,shooter,assist,three_pt,free_throw,possession_before,possession_after,wrong_time
Unnamed: 0_level_1,<chr>,<date>,<chr>,<chr>,<int>,<int>,<chr>,<dbl>,<dbl>,<chr>,...,<dbl>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>
1,401169778,2019-11-05,Villanova,Army,1,1,19:37,2377,2377,Saddiq Bey made Jumper.,...,,Villanova,made,Saddiq Bey,,False,False,Villanova,Army,False
2,401169778,2019-11-05,Villanova,Army,2,1,19:16,2356,2356,Tucker Blackwell made Jumper. Assisted by Tommy Funk.,...,,Army,made,Tucker Blackwell,Tommy Funk,False,False,Army,Villanova,False
3,401169778,2019-11-05,Villanova,Army,3,1,19:01,2341,2341,Foul on Jermaine Samuels.,...,,,,,,,,Villanova,Army,False
4,401169778,2019-11-05,Villanova,Army,4,1,19:01,2341,2341,Jermaine Samuels Turnover.,...,,,,,,,,Villanova,Army,False
5,401169778,2019-11-05,Villanova,Army,5,1,18:42,2322,2322,Matt Wilson made Jumper. Assisted by Tommy Funk.,...,,Army,made,Matt Wilson,Tommy Funk,False,False,Army,Villanova,False
6,401169778,2019-11-05,Villanova,Army,6,1,18:31,2311,2311,Jeremiah Robinson-Earl made Jumper. Assisted by Justin Moore.,...,,Villanova,made,Jeremiah Robinson-Earl,Justin Moore,False,False,Villanova,Army,False


# Reddit

In [None]:
#install.packages("RedditExtractoR") #only executable in Rstudio
library(RedditExtractoR)


subreddit <- "baseball"

# Get posts from the r/baseball subreddit
streaks <- find_thread_urls(keywords = "streak" ,subreddit=subreddit, sort_by="top", period = 'year')

hot <- find_thread_urls(keywords = "hot" ,subreddit=subreddit, sort_by="top", period = 'year')

write.csv(streaks, file = "./streaks.csv", row.names = FALSE)
write.csv(hot, file = "./hot.csv", row.names = FALSE)

Below you can see the first few rows of both the "hot" and "streaks" csv files.

![Figure 4](./images/hot.png)
![Figure 5](./images/streaks.png)

# News API

In [8]:
#| code-fold: true

API_KEY='05d7ae99b5b7455191c97c2c5c3a1f9b'

import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
#| code-fold: true

baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

def string_cleaner(input_string):
    try: 
        out=re.sub(r"""
                    [,.;@#?!&$-]+  # Accept one or more copies of punctuation
                    \ *           # plus zero or more copies of a space,
                    """,
                    " ",          # and replace it with a single space
                    input_string, flags=re.VERBOSE)

        #REPLACE SELECT CHARACTERS WITH NOTHING
        out = re.sub('[’.]+', '', input_string)

        #ELIMINATE DUPLICATE WHITESPACES USING WILDCARDS
        out = re.sub(r'\s+', ' ', out)

        #CONVERT TO LOWER CASE
        out=out.lower()
    except:
        print("ERROR")
        out=''
    return out

In [16]:
%%capture

TOPIC = 'hot streak sports'

URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}



#GET DATA FROM API
response = requests.get(baseURL, URLpost) #request data from the server
# print(response.url);  
response = response.json() #extract txt data from request into json



# #GET TIMESTAMP FOR PULL REQUEST
from datetime import datetime
timestamp = datetime.now().strftime("%Y-%m-%d-H%H-M%M-S%S")

# SAVE TO FILE 
with open(timestamp+'-newapi-raw-data.json', 'w') as outfile:
    json.dump(response, outfile, indent=4)

article_list=response['articles']   #list of dictionaries for each article
article_keys=article_list[0].keys()
#print("AVAILABLE KEYS:")
#print(article_keys)
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    if(verbose):
        print("#------------------------------------------")
        print("#",index)
        print("#------------------------------------------")

    for key in article_keys:
        if(verbose):
            print("----------------")
            print(key)
            print(article[key])
            print("----------------")

        if(key=='source'):
            src=string_cleaner(article[key]['name'])
            tmp.append(src) 

        if(key=='author'):
            author=string_cleaner(article[key])
            #ERROR CHECK (SOMETIMES AUTHOR IS SAME AS PUBLICATION)
            if(src in author): 
                print(" AUTHOR ERROR:",author);author='NA'
            tmp.append(author)

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        # if(key=='description'):
        #     tmp.append(string_cleaner(article[key]))

        # if(key=='content'):
        #     tmp.append(string_cleaner(article[key]))

        if(key=='publishedAt'):
            #DEFINE DATA PATERN FOR RE TO CHECK  .* --> wildcard
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date=article[key]
            if(not ref.match(date)):
                print(" DATE ERROR:",date); date="NA"
            tmp.append(date)

    cleaned_data.append(tmp)
    index+=1

df1 = pd.DataFrame(cleaned_data)
df1.to_csv('./data/raw_data/newsapi.csv', index=False) #,index_label=['title','src','author','date','description'])

Below you can see the first few rows of the newsapi.csv file:
<br></br>
![Figure 6](./images/newsapi.png)

# Individual Player Data

I also want to scrape specific data from fangraphs but was having some trouble. For now, I was able to download a few tables that had game data for Aaron Judge and merge them together. Below are screen shots of the intiial csv file.
<br></br>
![Figure 6](./images/judge1.png)
![Figure 7](./images/judge2.png)

# Extra Joke

How much data can be stored in a glacier? A frostbite
<br></br>
![Figure 8](./images/SnowMiser.png)