## Scraping Weather Data
Source : https://www.baseball-reference.com/boxes/

From this website, I was able to access the .json file of the games, since the game_logs held game_ids. 
In the .json file, I scraped the weather data, and saved it as another csv file to integrate to my final_data file.

In [23]:
import pandas as pd
import requests

In [24]:
game_logs = pd.read_csv("../data/game_logs_2024.csv")

In [26]:
def fetch_weather(gamePk: int) -> dict:
    url = f"https://statsapi.mlb.com/api/v1.1/game/{gamePk}/feed/live"
    resp = requests.get(url)
    if resp.status_code != 200:
        return {"temp": None, "condition": None, "wind": None}

    data = resp.json()
    weather = data.get("gameData", {}).get("weather", {})
    return {
        "game_id" : gamePk,
        "temp": weather.get("temp"),
        "condition": weather.get("condition"),
        "wind": weather.get("wind")

    }

Note : I had to save the data by chuncks, since it took a very long time to scrape all the data (>20 minutes), and kept crashing for some reason.

In [49]:
'''
### RUN ONLY ONCE, TAKES A LONG TIME ###

weather_list = []
for i, gamePk in enumerate(game_logs["game_id"], start=1):
    w = fetch_weather(gamePk)
    weather_list.append(w)

    if i % 100 == 0:  # save progress
        pd.DataFrame(weather_list).to_csv(f"../data/weather/weather_partial_{i}.csv", index=False)
        print(f"Saved {i}/{len(game_logs)} games")
        weather_list = []  # reset chunk
pd.DataFrame(weather_list).to_csv(f"../data/weather/weather_partial_{3000}.csv", index=False) #last chunck
'''

'\n### RUN ONLY ONCE, TAKES A LONG TIME ###\n\nweather_list = []\nfor i, gamePk in enumerate(game_logs["game_id"], start=1):\n    w = fetch_weather(gamePk)\n    weather_list.append(w)\n\n    if i % 100 == 0:  # save progress\n        pd.DataFrame(weather_list).to_csv(f"../data/weather/weather_partial_{i}.csv", index=False)\n        print(f"Saved {i}/{len(game_logs)} games")\n        weather_list = []  # reset chunk\npd.DataFrame(weather_list).to_csv(f"../data/weather/weather_partial_{3000}.csv", index=False) #last chunck\n'

#### Taking all the partial chuncks and putting them together

In [56]:
import glob 

weather_files = glob.glob("../data/weather/weather_partial_*.csv")
weather_dfs = [pd.read_csv(f) for f in weather_files]
weather_joined = pd.concat(weather_dfs, ignore_index=True)
print("Shape : ", weather_joined.shape)

final_game_logs = pd.read_csv("../data/final_game_logs.csv")
final_game_logs.merge(weather_joined, on="game_id", how="left").to_csv("../data/final_game_logs.csv")

Shape :  (2998, 4)


Weather data was successfully added to the final_game_logs. Preprocessing the data (turning strings to numerical values) will be in preprocessing.ipynb.