# Testing Weather Data

## Trevor Rowland - 3/25/25

This notebook aims to test the `regionweather.py` class, investigate how data should be cleaned, and start working on forecasting methods for this data.

## 1. Importing Packages and Data

Here we will import required packages, the RegionWeather class, and the dictionary of regions to use.

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from backend.regionweather import RegionWeather

In [18]:
region_data = {
    'US-FLA-FMPP': {'lat': 28.525581, 'lon': -81.536775, 'alt': 0},
    'US-FLA-FPC': {'lat': 28.996695, 'lon': -82.886613, 'alt': 0},
    'US-FLA-FPL': {'lat': 27.917488, 'lon': -81.450970, 'alt': 0},
    'US-FLA-GVL': {'lat': 29.619310, 'lon': -82.328732, 'alt': 0},
    'US-FLA-HST': {'lat': 25.456904, 'lon': -80.588092, 'alt': 0},
    'US-FLA-JEA': {'lat': 30.390902, 'lon': -83.679837, 'alt': 0},
    'US-FLA-SEC': {'lat': 28.805983, 'lon': -82.306291, 'alt': 0},
    'US-FLA-TAL': {'lat': 30.437174, 'lon': -84.248042, 'alt': 0},
    'US-FLA-TEC': {'lat': 27.959413, 'lon': -82.144821, 'alt': 0}
}

## 2. Testing RegionWeather Functions

Now let's test to make sure the initial implementation of RegionWeather works.

In [19]:
region_name = 'US-FLA-FMPP'
region = region_data['US-FLA-FMPP']
lat = region['lat']
lon = region['lon']
alt = region['alt']
end = dt.datetime.today()
start = end - dt.timedelta(365) # 30 days of data

rw = RegionWeather(region_name, lat, lon, alt, start, end)

fetching Hourly Object...
Hourly Object Fetched!
Fetching Hourly Data from Object...
Data:
                     temp  dwpt  rhum  prcp  snow   wdir  wspd  wpgt    pres  \
time                                                                           
2024-03-25 15:00:00  24.1  12.4  48.0   0.0   NaN   80.0  29.5   NaN  1018.3   
2024-03-25 16:00:00  24.1  12.4  48.0   0.0   NaN   80.0  29.5   NaN  1018.3   
2024-03-25 17:00:00  25.8  13.7  47.0   0.0   NaN  110.0  25.9   NaN  1017.4   
2024-03-25 18:00:00  25.2  14.1  50.0   0.0   NaN  130.0  24.1   NaN  1017.0   
2024-03-25 19:00:00  24.6  14.1  52.0   0.0   NaN  120.0  25.9   NaN  1016.6   

                     tsun  coco  
time                             
2024-03-25 15:00:00   NaN   2.0  
2024-03-25 16:00:00   NaN   3.0  
2024-03-25 17:00:00   NaN   3.0  
2024-03-25 18:00:00   NaN   3.0  
2024-03-25 19:00:00   NaN   3.0  
Columns: ['temp', 'dwpt', 'rhum', 'prcp', 'snow', 'wdir', 'wspd', 'wpgt', 'pres', 'tsun', 'coco']
Hourly Objec

## 3. RegionWeather EDA

Now all of the data has been fetched, aggregated and interpolated. We can check the `to_dict()` function from `RegionWeather` to view the data and perform a quick EDA.

In [20]:
d = rw.to_dict()
hourly = d['df_hourly']
daily = d['df_daily']
weekly = d['df_weekly']
monthly = d['df_monthly']
fifteen_m = d['df_15m']

In [21]:
hourly.head(10)

Unnamed: 0_level_0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2024-03-25 15:00:00,24.1,12.4,48.0,0.0,,80.0,29.5,,1018.3,,2.0
2024-03-25 16:00:00,24.1,12.4,48.0,0.0,,80.0,29.5,,1018.3,,3.0
2024-03-25 17:00:00,25.8,13.7,47.0,0.0,,110.0,25.9,,1017.4,,3.0
2024-03-25 18:00:00,25.2,14.1,50.0,0.0,,130.0,24.1,,1017.0,,3.0
2024-03-25 19:00:00,24.6,14.1,52.0,0.0,,120.0,25.9,,1016.6,,3.0
2024-03-25 20:00:00,24.6,14.1,52.0,0.0,,120.0,25.9,,1016.6,,3.0
2024-03-25 21:00:00,23.5,14.5,57.0,0.0,,80.0,29.5,,1016.6,,3.0
2024-03-25 22:00:00,22.4,14.0,59.0,0.0,,90.0,29.5,,1017.0,,3.0
2024-03-25 23:00:00,22.4,14.0,59.0,0.0,,90.0,27.7,,1017.0,,3.0
2024-03-26 00:00:00,21.3,14.7,66.0,0.0,,80.0,14.8,,1017.7,,1.0


In [22]:
daily.head(10)

Unnamed: 0_level_0,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1990-02-09,22.2,17.4,27.4,,,178.0,13.7,,,
1990-02-10,22.6,19.6,28.5,,,222.0,18.7,,,
1990-02-11,20.1,15.8,23.5,,,284.0,17.6,,,
1990-02-12,16.7,11.9,22.4,,,32.0,14.1,,,
1990-02-13,17.7,11.3,25.2,,,,10.7,,,
1990-02-14,19.7,14.6,25.8,,,142.0,14.0,,,
1990-02-15,22.9,18.0,29.1,,,162.0,15.4,,,
1990-02-16,24.2,18.5,28.5,,,199.0,16.2,,,
1990-02-17,23.8,20.2,29.1,,,228.0,12.5,,,
1990-02-18,23.1,19.1,28.5,,,119.0,13.9,,,


In [23]:
weekly.head(10)

Unnamed: 0_level_0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1990-02-11,22.0,18.1,79.2,0.0,,197.9,16.1,,,0.0,
1990-02-18,21.0,16.6,71.7,0.0,,148.4,13.7,,,0.0,
1990-02-25,19.3,14.7,77.5,0.0,,81.0,16.2,,,0.0,
1990-03-04,17.7,12.0,72.2,0.0,,41.1,14.8,,,0.0,
1990-03-11,20.3,14.1,70.2,0.0,,75.8,14.1,,,0.0,
1990-03-18,20.6,16.2,78.7,0.0,,129.8,15.9,,,0.0,
1990-03-25,19.5,11.6,63.6,0.0,,51.1,12.2,,,0.0,
1990-04-01,22.2,17.7,78.1,0.0,,142.3,11.3,,,0.0,
1990-04-08,20.0,13.5,66.3,0.0,,298.8,13.5,,,0.0,
1990-04-15,20.8,17.1,79.4,0.0,,69.9,15.1,,,0.0,


In [24]:
monthly.head(10)

Unnamed: 0_level_0,tavg,tmin,tmax,prcp,wspd,pres,tsun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2005-01-01,16.8,12.0,22.1,,12.5,,
2005-02-01,17.5,13.1,22.8,,13.4,,
2005-03-01,,,,,,,
2005-04-01,,,,,,,
2005-05-01,24.3,20.1,29.6,,10.4,,
2005-06-01,26.3,23.6,30.5,,10.2,,
2005-07-01,,,,,,,
2005-08-01,28.6,25.4,33.6,,7.7,,
2005-09-01,27.1,24.0,31.5,,11.2,,
2005-10-01,,,,,,,


In [25]:
fifteen_m.head(10)

Unnamed: 0_level_0,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2024-03-25 15:00:00,24.1,12.4,48.0,0.0,,80.0,29.5,,1018.3,,2.0
2024-03-25 15:15:00,24.1,12.4,48.0,0.0,,80.0,29.5,,1018.3,,2.25
2024-03-25 15:30:00,24.1,12.4,48.0,0.0,,80.0,29.5,,1018.3,,2.5
2024-03-25 15:45:00,24.1,12.4,48.0,0.0,,80.0,29.5,,1018.3,,2.75
2024-03-25 16:00:00,24.1,12.4,48.0,0.0,,80.0,29.5,,1018.3,,3.0
2024-03-25 16:15:00,24.525,12.725,47.75,0.0,,87.5,28.6,,1018.075,,3.0
2024-03-25 16:30:00,24.95,13.05,47.5,0.0,,95.0,27.7,,1017.85,,3.0
2024-03-25 16:45:00,25.375,13.375,47.25,0.0,,102.5,26.8,,1017.625,,3.0
2024-03-25 17:00:00,25.8,13.7,47.0,0.0,,110.0,25.9,,1017.4,,3.0
2024-03-25 17:15:00,25.65,13.8,47.75,0.0,,115.0,25.45,,1017.3,,3.0


For this region, S-FLA-FMPP, we are seeing a lot of `NaN` data for the columns `snow`, `wpgt`, and `tsun`. Let's take a look at the finest resolution from the API (hourly) to see what we should do with the NA data.

In [26]:
snow = hourly['snow']

print(f'There are {snow.isna().count()} missing values.')
print(f'There are {len(snow)} total values in the column.')
print(f'The percent missing is: {(snow.isna().count())/(len(snow))*100}%')

There are 8760 missing values.
There are 8760 total values in the column.
The percent missing is: 100.0%


From this we see that 100% of the snow data is unavailable. This could mean that the station does not measure snow, a measurement tool was damaged during this time, or that there was no snow. **How can we approach this programmatically in the pipeline?**

Now let's examine the other missing values

In [28]:
tsun = hourly['tsun']

print(f'There are {tsun.isna().count()} missing values.')
print(f'There are {len(tsun)} total values in the column.')
print(f'The percent missing is: {(tsun.isna().count())/(len(tsun))*100}%')

There are 8760 missing values.
There are 8760 total values in the column.
The percent missing is: 100.0%


In [29]:
wpgt = hourly['wpgt']

print(f'There are {wpgt.isna().count()} missing values.')
print(f'There are {len(wpgt)} total values in the column.')
print(f'The percent missing is: {(wpgt.isna().count())/(len(wpgt))*100}%')

There are 8760 missing values.
There are 8760 total values in the column.
The percent missing is: 100.0%


Now that we have examined a year's worth of data, we can see that **there are no measurements for the `snow`, `wpgt`, or `tsun` columns**. This means that we will need to use a fallback data source, or omit the columns from the dataset.