# Collecting All of the Data
**By: Victor Ardulov**

So in `alphantasy-hockey/fantasy/scrapping` there is a module that automates a lot of what we saw in [Exploring NHL&reg; Scrapter](./Exploring_NHL_Scraper.ipynb). Importantly there is this function `scrape_game` for which the "help" is outlined below. The point is here I am writing some code to actually collect all of the historical data for the last 10 (2009/10 - 2019/20 seasons) and saving the game_reports in to JSON dictionaries "locally" my rough back-of the envolope math says that the JSON files should be approximately 100KB each means:

$$ 100\text{kB} \times (\frac{82 \times31}{2}) \times 10 \approx 1.27 \text{GB} $$

So I've also added the `.json` extension to the `.gitignore` but the data will be stored in the top-level directory [nhl-data](../../nhl-data) organized by a subfolder for the season and then a file `game_report_****.json`. This folder can then be used and indexed by something ElasticSearch which may make it a little easier construct records. The alternative is to reorganize the game reports to be lists of player specific dictionaries, but I'm not sure so I'll probably end up consulting my resident ElasticSearch and Kibana experts.

In [1]:
import sys
sys.path.extend(["~/Documents/alphantasy-hockey/"])
import json

In [2]:
from fantasy.scraping.scraper import scrape_game
from dataclasses import asdict

In [3]:
scrape_game??

[0;31mSignature:[0m [0mscrape_game[0m[0;34m([0m[0mseasons[0m[0;34m=[0m[0;36m2020[0m[0;34m,[0m [0mgame_ids[0m[0;34m=[0m[0;34m'all'[0m[0;34m,[0m [0mverbose[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mscrape_game[0m[0;34m([0m [0mseasons[0m[0;34m=[0m[0mcurrent_year[0m[0;34m,[0m [0mgame_ids[0m[0;34m=[0m[0;34m"all"[0m[0;34m,[0m [0mverbose[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mif[0m [0misinstance[0m[0;34m([0m[0mgame_ids[0m[0;34m,[0m [0mint[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mgame_ids[0m [0;34m=[0m [0;34m[[0m[0mgame_ids[0m[0;34m][0m[0;34m[0m
[0;34m[0m    [0;32melif[0m [0mgame_ids[0m [0;34m==[0m [0;34m"all"[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mgame_ids[0m [0;34m=[0m [0mlist[0m[0;34m([0m[0mrange[0m[0;34m([0m[0;36m1[0m[0

In [4]:
from os import path, makedirs
path_to_output = path.join("..", "nhl-data")

In [9]:
for season in range(2009, 2020):
    print(f"Processing {season} - {season+1}")
    if not path.isdir(path.join(path_to_output, str(season))):
        makedirs(path.join(path_to_output, str(season)))
        
    print("Getting game reports...")
    game_reports = scrape_game(seasons=season, game_ids="all", verbose=True)
    print("Game reports done; writing to file...")
    for game_id in game_reports[season]:
        json_file = "game_report_%04d.json" % game_id
        with open(path.join(path_to_output, str(season), json_file), 'w') as output_file:
            json.dump(asdict(game_reports[season][game_id]), output_file)

Processing 2009 - 2010
Getting game reports...




Processed 1230 games for 2009 season
Game reports done; writing to file...
Processing 2010 - 2011
Getting game reports...




Processed 1230 games for 2010 season
Game reports done; writing to file...
Processing 2011 - 2012
Getting game reports...




Processed 1230 games for 2011 season
Game reports done; writing to file...
Processing 2012 - 2013
Getting game reports...




Processed 720 games for 2012 season
Game reports done; writing to file...
Processing 2013 - 2014
Getting game reports...




Processed 1230 games for 2013 season
Game reports done; writing to file...
Processing 2014 - 2015
Getting game reports...




Processed 1230 games for 2014 season
Game reports done; writing to file...
Processing 2015 - 2016
Getting game reports...




Processed 1230 games for 2015 season
Game reports done; writing to file...
Processing 2016 - 2017
Getting game reports...




Processed 1230 games for 2016 season
Game reports done; writing to file...
Processing 2017 - 2018
Getting game reports...




Processed 1271 games for 2017 season
Game reports done; writing to file...
Processing 2018 - 2019
Getting game reports...




Processed 1271 games for 2018 season
Game reports done; writing to file...
Processing 2019 - 2020
Getting game reports...




Processed 1082 games for 2019 season
Game reports done; writing to file...


In [11]:
with open("example.json", "w") as example_json:
    json.dump(asdict(game_reports[season][1]), example_json)