### Goal:
Clean wind data:
- get rid of unnecessary columns
- convert time to datetime object, then convert from PST to UTC
- keep only those data collected near the top of the hour
- fill NaN values with average of two adjacent hour measurements when possible

In [1]:
import pandas as pd
from pathlib import Path
from datetime import timedelta

In [2]:
root_folder = Path.cwd().parents[1]


In [3]:
df = pd.read_csv(root_folder/"data/raw/raw_wind.csv", usecols=['DATE','HourlyWindDirection','HourlyWindSpeed'])

#converting PST to UTC
df['UTC'] = pd.to_datetime(df['DATE'],utc=True)+timedelta(hours=8)

#making most observations at the top of the hour
df['UTC'] = df['UTC']+timedelta(minutes=7)

After some exploration, found that of the data that was taken at an odd time (ie not 53 minutes into the hour),  almost 20% of it are NaN values (almost 2000 entries). On the other hand, while the data taken at even intervals is 37 hours short of being complete, only 129 entries are NaN. So, I decided to just keep the top of the hour measurements, and fill in the NaN entries with an average of the two adjacent values when possible, resulting in 45 NaN values for wind direction, and 16 NaN values for wind speed. This instead of squinting through the odd interval measurements looking for complete entries to supplement the 37 missing hours/NaN values. I couuuuld track down those 37 missing hours and look for data in the odd intervals, but in the scheme of 5 years, 37 missing hours, even if all were daylight, only makes up 0.16% of total entries I'll look at.

In [4]:
#only keeping top of the hour observations
top = df[df['UTC'].dt.minute ==0].reset_index()

### Wind Speed

In [5]:
#getting list of na indicies so I can create an average to fill them
naspeed = top[top['HourlyWindSpeed'].isna()].index.tolist()

#removing the last index because NaN, nighttime value, and so that following loop will work
naspeed.remove(43786)

speed = top['HourlyWindSpeed']
#filling in average of neighbors
for i in naspeed:
    speed[i]=(speed[i-1]+speed[i+1])/2

top['Wind Speed']=speed

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  speed[i]=(speed[i-1]+speed[i+1])/2


### Wind Direction

Wind direction entries are strings because of the measurements ```VRB``` and ```000``` which mean 'variable' and 'calm' respectively. In order to make an average, I need integers, so turn anything that's not 'VRB', '000', or NaN into ```int```.

In [6]:
direction = top['HourlyWindDirection']

d_index = top['HourlyWindDirection'].index.to_list()

#getting the calm indicies
calm_winds = [i for i in d_index if direction[i] in ('VRB', '000')]

#getting nan indicies
nadirection = direction[direction.isna()].index.to_list()

#making integer values
int_direction = direction.drop(index=calm_winds+nadirection).astype(int)

#removing the last index because NaN, nighttime value, and so that following loop will work
nadirection.remove(43786)

#replacing string numbers with integers
direction.iloc[int_direction.index]=int_direction

#filling nan values with the average
for i in nadirection:
    if (type(direction[i-1])==int) & (type(direction[i+1])==int):
         direction[i]=(direction[i-1]+direction[i+1])/2

top['Wind Direction']=direction

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  direction.iloc[int_direction.index]=int_direction
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  direction[i]=(direction[i-1]+direction[i+1])/2


In [7]:
top[['UTC','Wind Speed','Wind Direction']].isna().sum()

UTC                0
Wind Speed        11
Wind Direction    45
dtype: int64

Pretty goooood

In [8]:
top[['UTC','Wind Speed','Wind Direction']].shape

(43787, 3)

In [9]:
24*365*5+24

43824

Still missing those 37 hours, but that's ok

In [10]:
top[['UTC','Wind Speed','Wind Direction']].to_csv(root_folder/'data/interim/00-wind.csv', index=False)