### First go at looking at complete data

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [136]:
root_folder = Path.cwd().parents[1]

df = pd.read_csv(root_folder/'data/processed/00-final.csv')

df['UTC'] = pd.to_datetime(df['PST'], utc=True)

In [138]:
df['UTC'].dtype

datetime64[ns, UTC]

In [140]:
df.head()

Unnamed: 0,PST,Tide,Height,Deg,Period,Wind Speed,Wind Direction,UTC
0,2017-01-01 07:00:00-08:00,3.82,2.47,306.0,11.0,7.0,260,2017-01-01 15:00:00+00:00
1,2017-01-01 08:00:00-08:00,4.42,2.37,309.0,9.0,6.0,40,2017-01-01 16:00:00+00:00
2,2017-01-01 09:00:00-08:00,5.13,2.38,310.0,11.0,0.0,0,2017-01-01 17:00:00+00:00
3,2017-01-01 10:00:00-08:00,5.66,2.66,314.0,11.0,0.0,0,2017-01-01 18:00:00+00:00
4,2017-01-01 11:00:00-08:00,5.71,2.38,303.0,11.0,5.0,150,2017-01-01 19:00:00+00:00


the `pd.to_datetime()` function not working how I thought it would:

In [141]:
df['PST'] = pd.to_datetime(df['PST'], utc=False)
df['PST'].dtype

dtype('O')

Don't feel like messing with it, so just created a UTC column... wanted to keep PST column just in case I need it later....even though seems like it's useless atm

![](devs_surfline.png)

![](compass.jpeg)

### **Swell Direction**

In [18]:
slice = 360/32
print(23*slice)
print(27*slice)

258.75
303.75


WSW = 258<br/>
WNW = 304<br/>
Any swell direction in the range $[258,304]$ will be considered ideal <br/>
Rounding to make window slightly bigger

### **Period**
15 second swell was so rare....let's see how often that happened 

In [32]:
df[df['Period']>=15].shape

(2337, 7)

ok maybe more often than I thought....

In [29]:
df.shape

(23850, 7)

In [31]:
df[df['Period']>=15]

Unnamed: 0,PST,Tide,Height,Deg,Period,Wind Speed,Wind Direction
21,2017-01-02 17:00:00-08:00,0.71,2.76,319.0,15.0,7.0,310
165,2017-01-16 07:00:00-08:00,2.46,2.31,285.0,17.0,0.0,000
166,2017-01-16 08:00:00-08:00,2.94,2.10,282.0,17.0,0.0,000
167,2017-01-16 09:00:00-08:00,3.69,2.48,283.0,17.0,0.0,000
168,2017-01-16 10:00:00-08:00,4.46,2.57,286.0,17.0,0.0,000
...,...,...,...,...,...,...,...
23728,2021-12-20 15:00:00-08:00,0.04,1.15,305.0,15.0,7.0,260
23729,2021-12-20 16:00:00-08:00,-0.45,1.12,293.0,15.0,6.0,250
23730,2021-12-20 17:00:00-08:00,0.07,1.20,289.0,15.0,0.0,000
23789,2021-12-26 10:00:00-08:00,2.88,2.83,312.0,15.0,9.0,270


I kinda want to go against surfline on this one and say that greater than 15sec period is fine, just means waves were big and probably epic <br/>
Should get some confirmation on that though <br/>
Also bc data is from Harvest buoy, might be more extreme than the actual swell that arrives on shore because Harvest is all the way out near point conception....could also do analysis with different buoy data, like have three different datasets (harvest, east sb, west sb) and run them all through a similar process

### **Wind Direction**

In [19]:
slice*5

56.25

N=0 <br/>
NE=56.25<br/>
Any wind $>$ 5mph in range $[0,57]$ will be considered ideal <br/>
Any wind $\leq$ 5mph in any direction will be considered ideal <br/>
Any wind $\geq$ 10mph in any direction will not be considered ideal <br/>

### **Tide**

In [25]:
print("Lowest Low:", df['Tide'].min())
print("Highest High:", df['Tide'].max())

Lowest Low: -1.78
Highest High: 7.61


From my experience, mid tide (good tide) at devs would be 2ft-5ft...should corroborate though

### **Surf Height**
This one is a bit more of a doozey because the swell height recorded at the buoy is not the same as the actual wave height....might have to stray away from surfline for this one and do some more research/asking around...can ask Haley, Chris Keet, Kenna, Nathan <br/>
Will mess around with analysis methods for now without including this parameter

### Wind is probably the easiest to filter through, so I'll start there
Gonna make a column called "ideal wind" and it'll have a 1 if ideal and 0 if not

In [142]:
zeros = np.zeros(df.shape[0],dtype=int)
df['Ideal Wind']=zeros
df.head()

Unnamed: 0,PST,Tide,Height,Deg,Period,Wind Speed,Wind Direction,UTC,Ideal Wind
0,2017-01-01 07:00:00-08:00,3.82,2.47,306.0,11.0,7.0,260,2017-01-01 15:00:00+00:00,0
1,2017-01-01 08:00:00-08:00,4.42,2.37,309.0,9.0,6.0,40,2017-01-01 16:00:00+00:00,0
2,2017-01-01 09:00:00-08:00,5.13,2.38,310.0,11.0,0.0,0,2017-01-01 17:00:00+00:00,0
3,2017-01-01 10:00:00-08:00,5.66,2.66,314.0,11.0,0.0,0,2017-01-01 18:00:00+00:00,0
4,2017-01-01 11:00:00-08:00,5.71,2.38,303.0,11.0,5.0,150,2017-01-01 19:00:00+00:00,0


In [36]:
df['Wind Direction'].value_counts()

000      3627
260      2038
250      1966
150      1467
240      1420
         ... 
195.0       1
10.0        1
250.0       1
315.0       1
100.0       1
Name: Wind Direction, Length: 67, dtype: int64

In [38]:
type(df['Wind Direction'].iloc[2])

str

Since '000' is given for calm winds, automatically those will qualify as ideal <br/>
```'VRB'``` means variable wind direction, so if the mph is greater than 5 and has VRB, not ideal

Think I should change the ```'000'``` to ```'calm'``` because python thinks it's 0

In [143]:
#changing '000' to 'calm'
df.loc[df['Wind Direction']== '000', 'Wind Direction'] = 'calm'

#'calm' winds are ideal
df.loc[df['Wind Direction']== 'calm', 'Ideal Wind'] = 1

#winds less than 5mph are ideal
df.loc[df['Wind Speed'] <= 5, 'Ideal Wind'] = 1

#winds with direction 0-57 are ideal
wdints = pd.to_numeric(df['Wind Direction'],errors='coerce') #only care about integer values
goodwind = (0 <= wdints) & (wdints <= 57) #boolean series
df.loc[goodwind, 'Ideal Wind'] = 1

#winds over 10 mph not ideal
df.loc[df['Wind Speed']>=10, 'Ideal Wind'] = 0

In [144]:
df.head(10)

Unnamed: 0,PST,Tide,Height,Deg,Period,Wind Speed,Wind Direction,UTC,Ideal Wind
0,2017-01-01 07:00:00-08:00,3.82,2.47,306.0,11.0,7.0,260,2017-01-01 15:00:00+00:00,0
1,2017-01-01 08:00:00-08:00,4.42,2.37,309.0,9.0,6.0,40,2017-01-01 16:00:00+00:00,1
2,2017-01-01 09:00:00-08:00,5.13,2.38,310.0,11.0,0.0,calm,2017-01-01 17:00:00+00:00,1
3,2017-01-01 10:00:00-08:00,5.66,2.66,314.0,11.0,0.0,calm,2017-01-01 18:00:00+00:00,1
4,2017-01-01 11:00:00-08:00,5.71,2.38,303.0,11.0,5.0,150,2017-01-01 19:00:00+00:00,1
5,2017-01-01 12:00:00-08:00,4.86,2.54,306.0,4.0,6.0,260,2017-01-01 20:00:00+00:00,0
6,2017-01-01 13:00:00-08:00,4.3,2.41,300.0,4.0,8.0,230,2017-01-01 21:00:00+00:00,0
7,2017-01-01 14:00:00-08:00,3.33,2.62,312.0,4.0,8.0,110,2017-01-01 22:00:00+00:00,0
8,2017-01-01 15:00:00-08:00,1.85,2.59,313.0,4.0,3.0,VRB,2017-01-01 23:00:00+00:00,1
9,2017-01-01 16:00:00-08:00,0.82,2.72,316.0,11.0,5.0,300,2017-01-02 00:00:00+00:00,1


In [151]:
df[['UTC', 'Ideal Wind']].groupby(df['UTC'].dt.year).sum()

  df[['UTC', 'Ideal Wind']].groupby(df['UTC'].dt.year).sum()


Unnamed: 0_level_0,Ideal Wind
UTC,Unnamed: 1_level_1
2017,1697
2018,1701
2019,1644
2020,1777
2021,1693


Gotta figure out how to do groupby year/month combo, hopefully there's a way besides splitting up the dataset by year and then doing all of this on 5 datasets