## Build Weather Model

### Objective

* Daily weather model, a set of a few parameters that describes the weather conditions for each disc golf game

### Rationale

* Why This?  In order to run predictive analytics on scores, we need to simplify the weather data down to just a few input parameters.  

* Why Me?  Since I will be building the weather - performance model, I am best suited to process these files.

* Why Now? Building the weather data is a necessary pre-requisite to evaluating different time-series models

### Requirements

* Pandas 0.24.2
* Numpy 1.16.4
* Matplotlib 3.1.0

### Input / Output

* The notebook should be in the folder `'models/notebooks` with the data in `models/wx_record/wx_station_by_date`

* Input files have the form `{station-id}_{mmddyy}_p01.csv` where `station-id` is a personal weather station id, e.g. KCASANFR1086, and the date is something like 033119

* Output file will be a csv that can be read into the database. `models/wx_model_data/wx_model.csv`

### Import / Set-Up

In [96]:
import pandas as pd
import numpy as np
import glob
import datetime as dt
import pytz
import json
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Make one big dataFrame, with a station_id added from filename
wx_df = pd.DataFrame()
filenames = glob.glob('../wx_record/wx_station_by_date/*.csv')
for filename in filenames:
    station_id = filename.split('_')[-3].split('\\')[-1]
    df_temp = pd.read_csv(filename, parse_dates=['time'])
    df_temp['station_id'] = station_id
    wx_df = wx_df.append(df_temp)
wx_df.head()

Unnamed: 0.1,Unnamed: 0,time,T,w_dir,w_spd,w_gust,rh,precip,station_id
0,0,2019-01-12 00:04:44,51.0,68.0,0.0,0.0,97.0,0.0,KCABERKE85
1,1,2019-01-12 00:09:58,51.0,68.0,0.0,0.0,97.0,0.0,KCABERKE85
2,2,2019-01-12 00:14:54,51.0,270.0,0.0,0.0,97.0,0.0,KCABERKE85
3,3,2019-01-12 00:19:50,50.0,270.0,0.0,0.0,97.0,0.0,KCABERKE85
4,4,2019-01-12 00:24:46,51.0,270.0,0.0,0.0,97.0,0.0,KCABERKE85


In [3]:
# Import the weather station data
station_df = pd.read_csv('../geo/nearby_wunder_pws.csv')
station_df.head()

Unnamed: 0,station_id,latitude,longitude,course_id,comments
0,KCABERKE85,,,2,no wind gust prior to 2018
1,KCAALBAN12,,,2,not examined yet
2,KCABERKE104,,,2,not examined yet
3,KCASANFR1443,37.777023,-122.48484,0,
4,KCASANFR1086,,,0,


In [4]:
wx_df = wx_df.merge(station_df, on="station_id")
wx_df['w_rad'] = (90 - wx_df['w_dir']) * 3.14159265359 / 180
wx_df['w_u'] = wx_df['w_spd'] * wx_df['w_rad'].apply(np.cos)
wx_df['w_v'] = wx_df['w_spd'] * wx_df['w_rad'].apply(np.sin)
wx_df = wx_df.drop(columns = ['comments', 'w_rad','Unnamed: 0']).set_index('time')
wx_df.head()

Unnamed: 0_level_0,T,w_dir,w_spd,w_gust,rh,precip,station_id,latitude,longitude,course_id,w_u,w_v
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019-01-12 00:04:44,51.0,68.0,0.0,0.0,97.0,0.0,KCABERKE85,,,2,0.0,0.0
2019-01-12 00:09:58,51.0,68.0,0.0,0.0,97.0,0.0,KCABERKE85,,,2,0.0,0.0
2019-01-12 00:14:54,51.0,270.0,0.0,0.0,97.0,0.0,KCABERKE85,,,2,-0.0,0.0
2019-01-12 00:19:50,50.0,270.0,0.0,0.0,97.0,0.0,KCABERKE85,,,2,-0.0,0.0
2019-01-12 00:24:46,51.0,270.0,0.0,0.0,97.0,0.0,KCABERKE85,,,2,-0.0,0.0


In [5]:
# Lastly, get the course timing data 
with open('../geo/course_timings.json', 'r') as timings_file:
    course_timings = json.load(timings_file)

### Process Data

In [6]:
max_course_id = wx_df.course_id.max()
max_course_id

2

In [57]:
def compute_weighting_factor(t : float, ramp_start: float, ramp_end: float, game_length: float) -> float:
    """Computes the player weighting factor at time t given that players start playing games of length 'game_length'
    in either a rolling fashion between 'ramp_start' and 'ramp_end', or if 'ramp_start' == 'ramp_end', there is a fixed
    starting point.  Time units must be consistent, the expected form is a number representing minutes past midnight.  
    Returns a weighting factor between 0 and 1.  See the function 'compute_weight_by_course_and_time_of_day' for 
    additional info.  Limited error handling is implemented."""
    
    weight = 0.0  #Default case for when no one is playing 
    if game_length == 0:  #Re-set to default to avoid division by zero later
        game_length = 180
    if ramp_start > ramp_end:   #Fix to avoid weights outside of [0, 1]
        ramp_start, ramp_end = ramp_end, ramp_start
    if ramp_start == ramp_end:   #Fixed starting point case
        if (t >= ramp_start) & (t <= (ramp_end + game_length)):
            weight = 1.0
    else:  #Rolling case
        if (ramp_end - ramp_start) > game_length:  #Some players finish before others begin
            if (t >= ramp_start) & (t < (ramp_start + game_length)):
                weight = (t - ramp_start) / game_length
            elif (t >= (ramp_start + game_length)) & (t < (ramp_end - game_length)):
                weight = 1.0
            elif (t > ramp_end) & (t <= (ramp_end + game_length)):
                weight = 1.0 - (t - ramp_end) / game_length
        else: #All players begin before first player ends
            if (t >= ramp_start) & (t < ramp_end):
                weight = (t - ramp_start) / (ramp_end - ramp_start)
            elif (t >= ramp_end) & (t < (ramp_start + game_length)):
                weight = 1.0
            elif (t >= (ramp_start + game_length)) & (t < (ramp_end + game_length)):
                weight = 1.0 - (t - ramp_start - game_length) / (ramp_end - ramp_start)
    
    return weight

In [58]:
def compute_weight_by_course_and_time_of_day(course_id: int, mins_since_midnight: float, dst: bool) -> float:
    """Given a course_id and time of day (in minutes since midnight), plus a boolean indicator (dst) of whether 
    Daylight Savings Time is in effect, compute a weight [0 to 1] to assign to resampled weather
    observations centered on this time of day.  This function expects that a dictionary named course_timings 
    will have been created by importing the course_timings.json file."""
    
    #Individual course data
    # Course_id 0 (Golden Gate Park) -- rolling start 7:30 - 10 AM on Sundays, 3 hour game
    # Course_id 1 (Chabot Lake) -- rolling start 8:00 AM - 2:00 PM, 3 hour game
    # Course_id 2 (Aquatic Park) -- rolling start 3:30 - 6 PM, 3 hour game during Daylight Savings Time
    # Course_id 2 (Aquatic Park) -- rolling start 9:00 - 10:30 AM, 3 hour game otherwise
    # The model assumes a uniform distribution of player start times within the rolling window. It puts 0 
    # outside of the window, and 1 for the maximum number of people playing.  For example, if there is a rolling
    # start at 7 AM - 9 AM and each game is 3 hours long, then times before 7 AM are weighted 0, the weight ramps
    # up to 1 by 9 AM, stays at 1 until 10 AM (when players who started at 7 AM are done), and then ramps down to 0
    # by 12 PM, when the players who started last complete their game.  All ramps are linear. 
    
    if not course_timings:
        print('Missing course timing data.  Please update the notebook to import "course_timings.json"')
        raise IOError
    
    ramp_up_start = 360   #Set some default values in case the course_id is not found
    ramp_up_end = 900
    game_length = 180
    # Now look for a course that matches 'course_id' in the course timings, and if one is found, replace the defaults
    for course in course_timings['courses']:
        if course.get('course_id',0) == course_id:
            start_times_key = 'start_times'  #Default keys for when Daylight Savings Time does not matter
            end_times_key = 'end_times'
            if course.get('dst_matters',False):
            # dst_matters == True means the rules differ based on whether Daylight Savings Time is in effect
                if dst:
                    start_times_key = 'start_times_dst'
                    end_times_key = 'end_times_dst'
                else: 
                    start_times_key = 'start_times_nondst'
                    end_times_key = 'end_times_nondst'
            if course[start_times_key].get('type','fixed') == 'rolling':
                ramp_up_start = course[start_times_key].get('rolling_earliest',360)
                ramp_up_end = course[start_times_key].get('rolling_latest',900)
            else:
                ramp_up_start = course[start_times_key].get('start_time',360)
                ramp_up_end = ramp_up_start
            game_length_type = course[end_times_key].get('type','fixed_length')
            if game_length_type == 'fixed_length':
                game_length = course[end_times_key].get('fixed_length',180)
            else:
                game_length = 180  #Nothing besides fixed_length implemented 
            break  #No need to continue searching if the course info has been found
    
    #Executes on completion of for loop, either by break or because course not found
    weighting_factor = compute_weighting_factor(mins_since_midnight, ramp_up_start, ramp_up_end, game_length)
    return weighting_factor 

In [92]:
weighted_data = pd.DataFrame()
for course_index in range(max_course_id + 1):
    course_df = wx_df[wx_df.course_id == course_index]
    dates_list = [item.date() for item in course_df.index]
    unique_days = list(set(dates_list))
    for date_of_game in unique_days:
        course_by_date_df = course_df[date_of_game.isoformat()]
        resampled_df = course_by_date_df[['T','w_dir','w_spd','w_gust','rh','precip','w_u','w_v']].resample('15T').mean()
        resampled_df = resampled_df.reset_index()
        resampled_df['minute_of_day'] = resampled_df['time'].dt.hour * 60 + resampled_df['time'].dt.minute + 7.5
        # Check whether Daylight Savings Time is in effect because this changes the way time of day is weighted 
        dst_test_time = dt.datetime.combine(date_of_game, dt.time.fromisoformat('05:00:00'))
        # Localize time to US Pacific and compute numbers of hours it is offset from UTC.  If 7, then DST is in effect.
        dst_in_effect = False
        check_UTC_offset = (pytz.timezone('US/Pacific').localize(dst_test_time).utcoffset().seconds / 3600 - 24) * -1
        if check_UTC_offset == 7.0:
            dst_in_effect = True
        resampled_df['weight_factor'] = resampled_df['minute_of_day'].apply(lambda x: \
                                        compute_weight_by_course_and_time_of_day(course_index, x, dst_in_effect))   
        resampled_df['weighted_T'] = resampled_df['weight_factor'] * resampled_df['T']
        resampled_df['weighted_w_dir'] = resampled_df['weight_factor'] * resampled_df['w_dir']
        resampled_df['weighted_w_spd'] = resampled_df['weight_factor'] * resampled_df['w_spd']
        resampled_df['weighted_w_gust'] = resampled_df['weight_factor'] * resampled_df['w_gust']
        resampled_df['weighted_rh'] = resampled_df['weight_factor'] * resampled_df['rh']
        resampled_df['weighted_precip'] = resampled_df['weight_factor'] * resampled_df['precip']
        resampled_df['weighted_w_u'] = resampled_df['weight_factor'] * resampled_df['w_u']
        resampled_df['weighted_w_v'] = resampled_df['weight_factor'] * resampled_df['w_v']
        resampled_df['course_id'] = course_index  #We re-inject this so that it will automatically appear in the new DataFrame
        if resampled_df['weight_factor'].sum() > 0:
            weighted_data[date_of_game] = resampled_df.sum() / resampled_df['weight_factor'].sum()

weighted_data = weighted_data.transpose()
weighted_data.head()

Unnamed: 0,T,w_dir,w_spd,w_gust,rh,precip,w_u,w_v,minute_of_day,weight_factor,weighted_T,weighted_w_dir,weighted_w_spd,weighted_w_gust,weighted_rh,weighted_precip,weighted_w_u,weighted_w_v,course_id
2019-02-10,368.30303,2053.336279,22.386574,51.298401,599.650463,0.03875,-19.266073,-4.392252,5760.0,1.0,46.014583,266.426768,2.263763,6.233681,70.02279,0.0,-2.015003,-0.454979,0.0
2019-03-03,419.677564,1818.695493,16.879946,38.669117,753.48049,0.030201,-13.271825,-8.924041,5760.0,1.0,53.790067,234.627308,2.51751,5.416747,96.966598,0.01434,-2.007448,-1.280875,0.0
2019-03-10,379.716004,815.540404,20.949815,46.134603,644.604966,0.08515,14.480887,-2.880723,5760.0,1.0,51.286667,94.25117,5.2435,9.159674,81.224842,5.7e-05,4.527552,-0.706237,0.0
2019-01-27,456.744213,948.302083,15.555556,36.237269,582.260417,0.0,9.500594,8.983949,5760.0,1.0,56.514352,66.074074,3.919907,7.743519,74.041204,0.0,2.704494,2.301259,0.0
2019-02-24,408.646086,1610.301473,9.14436,31.403072,626.197517,0.0,-3.931402,-5.200154,5760.0,1.0,52.775316,178.127525,1.955492,5.484659,75.957292,0.0,-0.245511,-1.23159,0.0


In [93]:
#Prepare and export df
weighted_data = weighted_data.drop(columns = ['T','w_dir','w_gust','rh','w_u','w_v','minute_of_day','weight_factor'])
weighted_data.to_csv('../wx_model_data/wx_model.csv')
weighted_data.head()

Unnamed: 0,w_spd,precip,weighted_T,weighted_w_dir,weighted_w_spd,weighted_w_gust,weighted_rh,weighted_precip,weighted_w_u,weighted_w_v,course_id
2019-02-10,22.386574,0.03875,46.014583,266.426768,2.263763,6.233681,70.02279,0.0,-2.015003,-0.454979,0.0
2019-03-03,16.879946,0.030201,53.790067,234.627308,2.51751,5.416747,96.966598,0.01434,-2.007448,-1.280875,0.0
2019-03-10,20.949815,0.08515,51.286667,94.25117,5.2435,9.159674,81.224842,5.7e-05,4.527552,-0.706237,0.0
2019-01-27,15.555556,0.0,56.514352,66.074074,3.919907,7.743519,74.041204,0.0,2.704494,2.301259,0.0
2019-02-24,9.14436,0.0,52.775316,178.127525,1.955492,5.484659,75.957292,0.0,-0.245511,-1.23159,0.0


SyntaxError: invalid syntax (<ipython-input-105-c61cd1494773>, line 1)