In [3]:
from PIL import Image
import datetime as dt

## *✭˚･ﾟ✧*･ﾟ 𝓬𝓪𝓷𝓪𝓭𝓲𝓪𝓷 𝓼𝓱𝓲𝓮𝓵𝓭 𝓭𝓲𝓼𝓬𝓱𝓪𝓻𝓰𝓮 𝓹𝓻𝓮𝓭𝓲𝓬𝓽𝓲𝓸𝓷 *✭˚･ﾟ✧*･ﾟ*

Taking notes as I go along...

---

I want to use a GCP VM for this project, just for practice. Set up one:

In [None]:
im = Image.open('misc_media/gcp_vm_0.png')
im

I assume I need a GPU for the ANN. Not sure exactly because I've never actually used one, but why not? However, there's the obvious price associaated with this... looks like it's about 0.70 per hour... Not the cheapest, but not the most expensive either (lol throwback to when I tried to use the ultra high memory VM and it cost 150 in a day. lesson learned, never again...)

I set it up with Ubuntu OS and 10 GB persistent disk

Then followed my tutorial for setting up an Ubuntu desktop... not really necessary (suppose everything could be done from command line) but maybe kind of convenient?

Just a reminder, to ssh into active VM instance via cmd type:

$ gcloud compute ssh hydro-ann --zone us-central1-a --ssh-flag "-L 5901:localhost:5901" 

And then you can access via VNC at 5901

Great okay now that that's done... what is this actual project going to be?

---

So, let's establish what data I'm using for this project:

 - ECCC discharge data
 (https://wateroffice.ec.gc.ca/mainmenu/real_time_data_index_e.html)
 <br>
 This is real-time and historic discharge data for all the ECCC stations in Canada.


- ECCC weather station data (https://climate.weather.gc.ca/historical_data/search_historic_data_e.html)
<br>
This is climate data for ECCC weather stations.

So obviously I cannot do this for all of the stations in Canada... probably makes sense to just choose a few stations to train and test on. And then can possibly expand to other stations to see if model is broadly applicable?

Let's say we have stations with nearby weather stations. We could do two different model trainings:

1. train using discharge stations and close by weather stations
2. train using discharge stations and ERA5 reanalysis

Then we could test to see how each performed. Does the ERA5 reanalysis do okay in comparison to the close by weather stations? Then, maybe this gives reason to believe resutls for stations without close-by weather stations. Don't know.

Regardless, I think I like the idea of choosing stations only with nearby weather stations for training. That way I can turn this variable on/off.

So, what stations?

- INDIN RIVER ABOVE CHALCO LAKE - (07SA004) and INDIN RIVER - (10757) - (data until 2004); about 0.25 km from each other; DATA EXISTS
- BAKER CREEK AT OUTLET OF LOWER MARTIN LAKE - (07SB013) and YELLOWKNIFE-HENDERSON - (45467) - (data until 2020); about 7 km from each other; DATA EXISTS
- HAY RIVER NEAR HAY RIVER - (07OB001) and HAY RIVER A - (1664) - (data until 2014); about 11 km from each other; DATA EXISTS
- HANBURY RIVER ABOVE HOARE LAKE - (06JB001) and HANBURY RIVER - (10897) - (data until 2020); about 1.6 km away from each other; DATA EXISTS
- SNARE RIVER BELOW GHOST RIVER - (07SA002) and INDIN RIVER - (10757) - (data until 2004); about 50 km away from each other; DATA EXISTS

я скучаю...

anyways

now I have 5 stations with nearby weather stations to use and varying time periods

---

download the data:

- 07SA004 ... done
- 07SB013 ... done
- 07OB001 ... done
- 06JB001 ... done
- 07SA002 ... done

- 10757 ... done 1997-2002
- 45467 ... done 2013-2018
- 1664 ... done 2009-2014
- 10897 ... done 2013-2018

** note: i might need to download more years for that weather station data if it doesn't overlap well with the discharge data

---

### Cleaning up the data and preliminary visualizations:

-

- Daily Discharge (m3/s) (PARAM = 1) and Daily Water Level (m) (PARAM = 2)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
indin = pd.read_csv(r'data/stream_flow/07SA004_daily.csv', sep='\\t')
baker = pd.read_csv(r'data/stream_flow/07SB013_daily.csv', sep= '\t')
hay = pd.read_csv(r'data/stream_flow/07OB001_daily.csv', sep= '\t')
hanbury = pd.read_csv(r'data/stream_flow/06JB001_daily.csv', sep= '\t')
snare = pd.read_csv(r'data/stream_flow/07SA002_daily.csv', sep= '\t')

In [None]:
indin_flow = indin.loc[indin['PARAM'] == 1]
indin_lev = indin.loc[indin['PARAM'] == 2]

baker_flow = baker.loc[baker['PARAM'] == 1]
baker_lev = baker.loc[baker['PARAM'] == 2]

hay_flow = hay.loc[hay['PARAM'] == 1]
hay_lev = hay.loc[hay['PARAM'] == 2]

hanbury_flow = hanbury.loc[hanbury['PARAM'] == 1]
hanbury_lev = hanbury.loc[hanbury['PARAM'] == 2]

snare_flow = snare.loc[snare['PARAM'] == 1]
snare_lev = snare.loc[snare['PARAM'] == 2]

In [None]:
flow = hay_flow

In [None]:
flow.columns = ['station', 'param', 'date', 'flow', 'sym']

flow['datetime'] = pd.to_datetime(flow['date'])

#hanbury_holdout = hanbury_flow[(hanbury_flow['datetime'] > '12-31-2017') & (hanbury_flow['datetime'] < '01-01-2019')]
#hanbury_holdout = hanbury_holdout.set_index('datetime')
#level = hanbury_lev[(hanbury_lev['datetime'] > '12-31-2017') & (hanbury_lev['datetime'] < '01-01-2019')]
#level = level.set_index('datetime')
#hanbury_holdout['level'] = level['level']

#snare_holdout['station'] = '07SA004'

#hanbury_holdout = hanbury_holdout.drop(['param', 'sym'], axis=1)

In [None]:
#hanbury_holdout.to_csv('hanbury_holdout.csv')

The test/validation set is an entire year of the 5-year datasets.
I will just split the year into four "seasons", JFM, AMJ, JAS, OND. We will randomly select half of a season for validation and half of a season for testing from a random 4/5 years.

In [None]:
flow = flow[(flow['datetime'] > '12-31-2008') & (flow['datetime'] < '01-01-2014')]
flow = flow.set_index('datetime')

In [None]:
years = [2009, 2010, 2011, 2012, 2013]
year_seas = []
chosen_years = []
half_seasons_str = [['01-01', '02-16'], ['02-16','04-01'], ['04-01', '05-16'], ['05-16', '07-01'], ['07-01', '08-16'], ['08-16', '10-01'], ['10-01', '11-16'], ['11-16', '12-31']]

In [None]:
half_seasons = np.arange(0,8,1)
for i in range(0,4):
    year = np.random.choice(years)
    years = np.delete(years, np.where(years == year))
    chosen_years.append(year)
    
    seas_0 = np.random.choice(half_seasons)
    seases = []
    seases.append(seas_0)
    half_seasons = np.delete(half_seasons, np.where(half_seasons == seas_0))
    seas_1 = np.random.choice(half_seasons)
    half_seasons = np.delete(half_seasons, np.where(half_seasons == seas_1))
    seases.append(seas_1)
    year_seas.append(seases)

In [None]:
print(chosen_years)
print(year_seas)

In [None]:
start = half_seasons_str[year_seas[0][0]][0] + '-' + str(chosen_years[0])
end = half_seasons_str[year_seas[0][0]][1] + '-' + str(chosen_years[0])
test = flow[(flow.index >= start) & (flow.index<end)]
start = half_seasons_str[year_seas[0][1]][0] + '-' + str(chosen_years[0])
end = half_seasons_str[year_seas[0][1]][1] + '-' + str(chosen_years[0])
valid = flow[(flow.index >= start) & (flow.index<end)]

for i,year in enumerate(chosen_years[1:]):
    start = half_seasons_str[year_seas[i][0]][0] + '-' + str(year)
    end = half_seasons_str[year_seas[i][0]][1] + '-' + str(year)
    test = test.append(flow[(flow.index >= start) & (flow.index<end)])
    start = half_seasons_str[year_seas[i][1]][0] + '-' + str(year)
    end = half_seasons_str[year_seas[i][1]][1] + '-' + str(year)
    valid = valid.append(flow[(flow.index >= start) & (flow.index<end)])

In [None]:
test = test.drop(['param', 'sym'], axis=1)
test = test.sort_values(by='datetime')
test.to_csv('hay_test.csv')

valid = valid.drop(['param', 'sym'], axis=1)
valid = valid.sort_values(by='datetime')
valid.to_csv('hay_valid.csv')

In [None]:
test_dates = pd.read_csv('hay_test.csv').set_index('datetime')
valid_dates = pd.read_csv('hay_valid.csv').set_index('datetime')

In [None]:
train = flow

In [None]:
train = train[np.isin(train.date.values,test_dates.date.values, invert=True)]

In [None]:
train = train.drop(['param', 'sym'], axis=1)
train = train.sort_values(by='datetime')
train.to_csv('hay_train.csv')

---

ugh okay now I need the associated weather station data:

In [None]:
test = pd.read_csv('hanbury_test.csv')
train = pd.read_csv('hanbury_train.csv')
valid = pd.read_csv('hanbury_valid.csv')
holdout = pd.read_csv('hanbury_holdout.csv')

In [None]:
weather_0 = pd.read_csv(r'data/weather/10897/10897_2013_daily.csv', sep=',')
weather_1 = pd.read_csv(r'data/weather/10897/10897_2014_daily.csv', sep=',')
weather_2 = pd.read_csv(r'data/weather/10897/10897_2015_daily.csv', sep=',')
weather_3 = pd.read_csv(r'data/weather/10897/10897_2016_daily.csv', sep=',')
weather_4 = pd.read_csv(r'data/weather/10897/10897_2017_daily.csv', sep=',')
holdout_weather = pd.read_csv(r'data/weather/10897/10897_2018_daily.csv', sep=',')

In [None]:
weather_0['datetime'] = pd.to_datetime(weather_0['Date/Time'], format = '%Y-%m-%d')
weather_0 = weather_0.set_index('datetime')
weather_1['datetime'] = pd.to_datetime(weather_1['Date/Time'], format = '%Y-%m-%d')
weather_1 = weather_1.set_index('datetime')
weather_2['datetime'] = pd.to_datetime(weather_2['Date/Time'], format = '%Y-%m-%d')
weather_2 = weather_2.set_index('datetime')
weather_3['datetime'] = pd.to_datetime(weather_3['Date/Time'], format = '%Y-%m-%d')
weather_3 = weather_3.set_index('datetime')
weather_4['datetime'] = pd.to_datetime(weather_4['Date/Time'], format = '%Y-%m-%d')
weather_4 = weather_4.set_index('datetime')

holdout_weather['datetime'] = pd.to_datetime(holdout_weather['Date/Time'], format = '%Y-%m-%d')
holdout_weather = holdout_weather.set_index('datetime')

In [None]:
weather = weather_0.append(weather_1).append(weather_2).append(weather_3).append(weather_4)
#weather = holdout_weather

In [None]:
df = valid

In [None]:
df['datetime'] = pd.to_datetime(df['datetime'], format = '%Y-%m-%d') 
df = df.set_index('datetime')

In [None]:
df['st_max_temp'] = weather['Max Temp (°C)']
df['st_min_temp'] = weather['Min Temp (°C)']
df['st_mean_temp'] = weather['Mean Temp (°C)']
df['st_heat_deg_days'] = weather['Heat Deg Days (°C)']
df['st_cool_deg_days'] = weather['Cool Deg Days (°C)']
df['st_total_rain'] = weather['Total Rain (mm)']
df['st_total_snow'] = weather['Total Snow (cm)']
df['st_total_precip'] = weather['Total Precip (mm)']
df['st_total_snow_on_ground'] = weather['Total Precip (mm)']
df['st_dir_of_max_gust_10sdeg'] = weather['Dir of Max Gust (10s deg)']
df['st_spd_of_max_gust_kmh'] = weather['Spd of Max Gust (km/h)']

In [None]:
df.to_csv('hanbury_valid.csv')

---

alright now I need to add the ERA5 reanalysis data:

In [1]:
import xarray as xr

ModuleNotFoundError: No module named 'xarray'

coords of weather stations: