# Exploratory Data Analysis: Games

## Analysis Questions
### TO DO
* Data Structure / Granularity
* Scale & Basic Stats
* Distribution
* Correlation

### DONE

# Code & Data Setup

In [1]:
library(tidyverse)
library(bigrquery)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [6]:
bq_auth()

Next step, download data from view "GAMES_V" - view definition is contained in file https://github.com/bbrewington/hockey-analytics/blob/main/data-moneypuck/bigquery-views/sql/games.sql

(the "DATA.GAMES" table is combination of all years of games data here - http://moneypuck.com/data.htm - details are in data pipeline queries here: https://github.com/bbrewington/hockey-analytics/tree/main/data-moneypuck)

In [7]:
games <- 
  bq_project_query('moneypuckdata-sandbox',
                   'SELECT * FROM `moneypuckdata-sandbox.DATA.GAMES_V`') %>%
  bq_table_download()

# Data Structure / Granularity

Using dplyr::glimpse to see data structure - shows all variables, num rows, num columns, R data type, and values in first 15ish rows

Data Granularity: season / gameId / team / situation
* season: YYYY (year where season starts - in normal, years, the Fall)
* gameID: YYYYNNNNNN
  - first 4 digits identify the season of the game (ie. 2017 for the 2017-2018 season). The next 2 digits give the type of game, where 01 = preseason, 02 = regular season, 03 = playoffs, 04 = all-star. The final 4 digits identify the specific game number. For regular season and preseason games, this ranges from 0001 to the number of games played. (1271 for seasons with 31 teams (2017 and onwards) and 1230 for seasons with 30 teams). For playoff games, the 2nd digit of the specific number gives the round of the playoffs, the 3rd digit specifies the matchup, and the 4th digit specifies the game (out of 7)
  - info from: https://gitlab.com/dword4/nhlapi/-/blob/master/stats-api.md#game-ids
* situation

In [8]:
glimpse(games)

Rows: 157,510
Columns: 117
$ team                                      [3m[38;5;246m<chr>[39m[23m "ANA", "ANA", "ANA", "ANA",…
$ season                                    [3m[38;5;246m<int>[39m[23m 2008, 2008, 2008, 2008, 200…
$ name                                      [3m[38;5;246m<chr>[39m[23m "ANA", "ANA", "ANA", "ANA",…
$ gameId                                    [3m[38;5;246m<int>[39m[23m 2008020008, 2008020008, 200…
$ playerTeam                                [3m[38;5;246m<chr>[39m[23m "ANA", "ANA", "ANA", "ANA",…
$ opposingTeam                              [3m[38;5;246m<chr>[39m[23m "S.J", "S.J", "S.J", "S.J",…
$ home_or_away                              [3m[38;5;246m<chr>[39m[23m "AWAY", "AWAY", "AWAY", "AW…
$ gameDate                                  [3m[38;5;246m<int>[39m[23m 20081009, 20081009, 2008100…
$ position                                  [3m[38;5;246m<chr>[39m[23m "Team Level", "Team Level",…
$ situation                          