![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# River and Lake Water Levels

Have you ever wondered what a lake's water levels can reveal? Investigating a lake's water levels can provide valuable insight into its hydrological behaviour. By analyzing the fluctuations and patterns in water levels, we can gain a better sense of lake dynamics and make informed decisions in regard to the surrounding ecosystem.

*Curriculum Connections*

- [Freshwater and Saltwater Systems](https://education.alberta.ca/media/3069389/pos_science_7_9.pdf) 
- [Investigate Forces of Water, air including force lift and drag](https://www.alberta.ca/curriculum-science.aspx)

*Investigating Questions*

- What are the historical trends in water levels for specific water bodies in Canada, and how have they changed over time?
- What factors influence the water levels in Canada's rivers, lakes, and reservoirs?
- How do natural factors like evaporation and groundwater recharge influence water levels in Canada?

***
The [Shuswap Lake](https://bcparks.ca/shuswap-lake-park/) is a beautiful and popular freshwater lake located in the Okanagan Region of British Columbia, and it will be the primary body of water that we'll be exploring throughout this notebook. The water levels of this particular lake fluctuates over the year due to rain fall and snow run off coming from the mountains.

[![Shuswap Lake](https://img.youtube.com/vi/1fJlFh4eJ08/0.jpg)](https://www.youtube.com/watch?v=1fJlFh4eJ08)

### Import the Data
The code below will import the Python programming libraries we need to gather and organize the data to answer our question. `▶Run` the code cell below 

In [None]:
## import libraries
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import folium
from folium.plugins import MarkerCluster
print('Libraries imported.')

### Introductory Analysis

Let's take a look at the most recent water level data about the Shuswap Lake. Datasets were obtained from the [Government of Canada](https://dd.weather.gc.ca/hydrometric/).

In [None]:
station = "08LE070"
shuswap_data= pd.read_csv(f'https://dd.weather.gc.ca/hydrometric/csv/BC/daily/BC_{station}_daily_hydrometric.csv')
shuswap_data

We notice in our dataset that we have _10_ different columns. In particular, `Water Level / Niveau d'eau (m)` and `Date` appear to be the most important as we can investigate correlations on whether certain dates have ascending or descending water levels.

For now, let's **plot** these important columns to see if there are any trends.

In [None]:
px.line(shuswap_data, x="Date", y="Water Level / Niveau d'eau (m)",title="Shuswap Lake Levels")

### Questions:

1. Is there an ascending, (going upwards), descending (going downwards), or no trend in the visualization? 
1. Can you think of reasons why particular months would have ascending or descending trends?
2. Does the rate of growth between water levels and months surprise you? Do you think the growth should be slower or faster? 

### Exploring Extremities

Now that we have investigated correlations ranging in the past couple of months, let's delve into datasets that explore different attributes regarding water levels. In particular, by looking into specialized datasets from the [Government of Canada](https://wateroffice.ec.gc.ca/mainmenu/real_time_data_index_e.html).

In [None]:
extreme_data = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/Science/WaterLevels/annual_extremes.csv", skiprows=1)
extreme_data.head()

The dataset *extreme_data* obtained above explores the extremities of the Shuswap Lake, that being the `minimum` and `maximum` water levels throughout a year.

In [None]:
shuswap_annual = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/Science/WaterLevels/shuswap_data.csv", skiprows=1)
shuswap_annual.head()

The dataset *shuswap_annual* obtained above explores the water level of the Shuswap Lake throughout an entire year spanning from *1951* to *2021*.

### Data Cleaning

Data cleaning is like tidying up information to make it useful and accurate. Just like you clean and organize your room, data cleaning helps make data neat and organized.

Imagine you have a bunch of information about your classmates, like their names, ages, and favorite colors. But sometimes, mistakes or errors can happen when collecting this information. For example, someone may have misspelled a name or entered the wrong age for a classmate.

Data cleaning involves finding and fixing these mistakes. It's important because clean data helps us make better decisions and find meaningful patterns.

In this particular situation, we want to remove the columns `SYM` and `SYM.1`. These columns are mainly composed of **NaN** values, or no value at all meaning many of the rows do not contain a value for these columns. This essentially means that `SYM` and `SYM.1` have almost nothing to do with water levels and most likely would lead to little insight. 

In [None]:
# Drop the columns SYM, SYM.1
extreme_data = extreme_data.drop(columns=['SYM', 'SYM.1']) 
shuswap_annual = shuswap_annual.drop(columns=['SYM'])

Now that our dataset has been *cleaned*, let's find the highest and lowest extremities within our dataset, that being the year, month, and day of when the highest/lowest water-level was recorded.

In [None]:
max_level = extreme_data.query('MAX == MAX.max()')
min_level = extreme_data.query('MIN == MIN.min()')

maxvals = max_level.to_numpy()
minvals = min_level.to_numpy()


print(f"The highest water-level recorded was {maxvals[0][4]} (meters) in {maxvals[0][2]}-{maxvals[0][3].replace('--', '-')}")
print(f"The lowest water-level recorded was {minvals[0][4]} (meters) in {minvals[0][2]}-{minvals[0][5].replace('--', '-')}")

Surprisingly, both the highest and lowest water-levels recorded was in late 1900s. The highest water-level recorded was on May 14th while the lowest water-level recorded was on February 27th. 

Can you think of any particular reasons why water-levels would be high during the month of May, and vice-versa, can you think of any reason why water-levels would be low during the month of February?

Now that we have the year, month, and date of the highest and lowest water-levels recorded, let's visualize the highest and lowest water-level years alongside the extremities from 1951-2021 and see if any trends are apparent.

We also should define a function in order to convert our values we obtain in our dataset from integers (1-366) to actual dates. 

To clarify, a function is like a *specialized tool* that takes certain inputs and performs a specific task, producing a desired output or result. In this case, we're inputting our **dates** from 0-366 into the function, and it outputs an **actual date** that corresponds to that integer. 

In [None]:
def convert_to_date(day_number):
    if day_number < 1 or day_number > 366:
        return "Invalid day number. Please enter a number between 1 and 366."

    days_in_month = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    month_names = ["January", "February", "March", "April", "May", "June",
                   "July", "August", "September", "October", "November", "December"]

    if day_number == 366:
        return "December 31"

    date = None
    month_index = 0

    for days_in_current_month in days_in_month:
        if day_number <= days_in_current_month:
            date = day_number
            break
        else:
            day_number -= days_in_current_month
            month_index += 1

    if date is not None:
        return f"{month_names[month_index]} {date}"
    else:
        return "Error: Could not determine the date."

day_number = 365
result = convert_to_date(day_number)
print(f"Day number {day_number} is {result}.")

In [None]:
pd.options.mode.chained_assignment = None  

max_year = shuswap_annual.query(f'YEAR == {maxvals[0][2]}')
min_year = shuswap_annual.query(f'YEAR == {minvals[0][2]}')

# Create new column Date for our DD column and apply our function defined above
max_year['Date'] = max_year['DD'].apply(convert_to_date)
min_year['Date'] = min_year['DD'].apply(convert_to_date)

queried_fig = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Highest Water Level Year (1972)", "Lowest Water Level Year (1980)"), x_title="Days", y_title="Water Level (m)")

queried_fig.add_trace(
    go.Scatter(x=max_year['Date'], y=max_year['Value']),
    row=1, col=1
)

queried_fig.add_trace(
    go.Scatter(x=min_year['Date'], y=min_year['Value']),
    row=1, col=2
)

queried_fig.update_layout(showlegend=False)
queried_fig.show()

Looking at both plots, while it is clear that the highest and lowest extremities contain their respective maximums and minimums, it's also interesting to note that in the year of the lowest water-level, general trends of lower levels of water are apparent. 

Comparing maximums between the lowest and highest water-level years, the gap is significant and distinctly apparent. What do you think happened during the year of 1980 in Lake Shuswap which could have potentially caused these low water-levels?

In [None]:
max_data = extreme_data.dropna(subset=['MAX'])
min_data = extreme_data.dropna(subset=['MIN'])

max_min_fig = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Highest Water Levels", "Lowest Water Levels"), x_title="Year", y_title="Water Level (m)")

max_min_fig.add_trace(
    go.Scatter(x=max_data['Year'], y=max_data['MAX']),
    row=1, col=1
)

max_min_fig.add_trace(
    go.Scatter(x=min_data['Year'], y=min_data['MIN']),
    row=1, col=2
)
max_min_fig.update_layout(showlegend=False)
max_min_fig.show()

Surprisingly, despite the global sea-level rise of approximately 21-24 centimeters since the past 150 years, it appears that Lake Shuswap's minimum and maximum extremities haven't changed much despite the 70 year gap.  

To take a deeper look, let's compare the overall water-levels of years and see if we can compare any apparent differences.

In [None]:
# Obtain 1/3rds of dates
half_way1 = shuswap_annual.query('YEAR <= 1974')
half_way2 = shuswap_annual.query('`YEAR` >= 1975 and `YEAR` <= 1997')
half_way3 = shuswap_annual.query('`YEAR` >= 1998 and `YEAR` <= 2021')
# Convert DD to Date
half_way2['Date'] = half_way2['DD'].apply(convert_to_date)
half_way3['Date'] = half_way3['DD'].apply(convert_to_date)

Note: For figure `Shuswap Lake Water-Levels from 1951 to 1974` the dates are posted from 1-365 where 1 is **January 1st**, and 365 is **December 31st**.

In [None]:
half_way_fig1 = px.scatter(half_way1, x='DD', y='Value', color='YEAR', labels={'Value': 'Water Level (m)'}, title="Shuswap Lake Water-Levels from 1951 to 1974").show()
half_way_fig2 = px.scatter(half_way2, x='Date', y='Value', color='YEAR', labels={'Value': 'Water Level (m)'}, title="Shuswap Lake Water-Levels from 1975 to 1997").show()
half_way_fig3 = px.scatter(half_way3, x='Date', y='Value', color='YEAR', labels={'Value': 'Water Level (m)'}, title="Shuswap Lake Water-Levels from 1998 to 2021").show()

From looking at all years, it appears that there still aren't any major differences that can lead to the conclusion to say Lake Shuswap water-levels are increasing or decreasing. 

This may come as surprising as global media has highlighted the dangers of the constantly rising sea-level but sometimes changes in local areas may not reflect changes seen in a global scale. This isn't to discredit the increases in sea-level as fictitious but conditions surrounding a particular area may make it less susceptible to change compared to other bodies of water.

### Map of Stations

We can also compare water stations throughout Canada using real-time hydrometric data provided by the [Government of Canada](https://wateroffice.ec.gc.ca/map/index_e.html). 

sing the map provided below, find a particular water station you're interested in and take note of the **station number**

In [None]:
df = pd.read_csv('https://wateroffice.ec.gc.ca/map/download_e.html?type=real_time&filters=%7B%22station_id%22%3A%22%22%2C%22station_name%22%3A%22%22%2C%22province%22%3A%22all%22%2C%22region%22%3A%22CAN%22%2C%22basin%22%3A%22all%22%2C%22parameter%22%3A%22all%22%2C%22operation_schedule%22%3A%22all%22%2C%22operating_agency%22%3A%22all%22%7D')
latitude = df['Latitude'].mean()
longitude = df['Longitude'].mean()
station_map = folium.Map(location=[latitude,longitude], zoom_start=3)
marker_cluster = MarkerCluster()
for row in df.iterrows():
    marker_cluster.add_child(folium.Marker(location=[row[1]['Latitude'],row[1]['Longitude']], popup=[row[1]['Station Name'], row[1]['Station ID']]))
station_map.add_child(marker_cluster)
station_map

In the cell below, change the variable station_map_num1 and station_map_num2 to the numbers of the water stations you want to explore.

In [None]:
station_map_num1 = '07ED001'
station_map_num2 = '05BH004'


row1 = df[df['Station ID']==station_map_num1]['Province'].index[0]
row2 = df[df['Station ID']==station_map_num2]['Province'].index[0]

province1 = df['Province'].loc[row1]
province2 = df['Province'].loc[row2]
name1 = df['Station Name'].loc[row1]
name2 = df['Station Name'].loc[row2]

otherwater_data1 = pd.read_csv(f'https://dd.weather.gc.ca/hydrometric/csv/{province1}/daily/{province1}_{station_map_num1}_daily_hydrometric.csv')
otherwater_data2 = pd.read_csv(f'https://dd.weather.gc.ca/hydrometric/csv/{province2}/daily/{province2}_{station_map_num2}_daily_hydrometric.csv')
compare_fig = make_subplots(rows=1, cols=2, subplot_titles=(name1, name2), x_title="Date", y_title="Water Level (m)")
compare_fig.add_trace(
    go.Scatter(x=otherwater_data1['Date'], y=otherwater_data1["Water Level / Niveau d'eau (m)"]),
    row=1, col=1
)
compare_fig.add_trace(
    go.Scatter(x=otherwater_data2['Date'], y=otherwater_data2["Water Level / Niveau d'eau (m)"]),
    row=1, col=2
)
compare_fig.update_layout(showlegend=False)
compare_fig.show()

### Questions:

1. How do the plots of water levels between the two bodies of water differ? Similarly, how are the bodies of water similar? Are there any noticeable patterns or trends that stand out?
2. What factors might contribute to the differences in water level plots between the two bodies of water? Consider geographical, environmental, or human-related factors.
3. Reflect on the limitations and uncertainties in comparing water level plots between two bodies of water. What factors might introduce biases or errors in the comparison, and how can we account for these limitations?

### Predictions on Water-Levels: Machine Learning

Now that we've done analysis on the extremities within Lake Shuswap, let's see if we can *predict* what future water levels could potentially look like.

To predict future water levels, we utilize what is known as **machine learning**. Machine learning is an interesting field where it focuses on teaching computers how to learn and make decisions without being explicitly programmed. It's like training a computer to think and make *predictions*, just like humans do.

[![ML video](https://img.youtube.com/vi/f_uwKZIAeM0/0.jpg)](https://www.youtube.com/watch?v=f_uwKZIAeM0)



Let's obtain data from the year **2021** from the `YEAR` column. By doing so, we have access to all 365 days within the year, so we can give a variety of accurate data points for our model. 

In [None]:
sect_shuswap_annual = shuswap_annual.query('YEAR == 2021')
display(sect_shuswap_annual)

Having many data points in machine learning is like having a lot of examples to learn from. Imagine you're trying to learn how to identify different types of fruits. 

If you only have a *few* fruits to practice with, like an apple and a banana, it might be challenging to recognize other fruits like oranges or strawberries. But if you have a big basket filled with many different fruits, you can learn to recognize a *wider* variety. Furthermore, if you supply the basket with apples that are slightly different from one another, you could become more familiar with *distinguishing* what makes a fruit an apple or not due to knowing these slight differences.

Now that we have the data points that we want to train our model on, let's see what visualization the model outputs. 

In [None]:
X_data = sect_shuswap_annual['DD'].to_numpy()
y_data = sect_shuswap_annual["Value"].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20)
mymodel = np.poly1d(np.polyfit(X_train, y_train, 8))
myline = np.linspace(min(X_train), max(X_train), 100)

plt.scatter(X_train, y_train)
plt.plot(myline, mymodel(myline))
plt.xlabel("Date")
plt.ylabel("Water Level (m)")
plt.show()

Looking above, it appears that the curve that estimates where certain points would be on alongside the x or `Date` axis is estimating the data points fairly well. While certain points of the curve may be slightly off, this actually indicates that our model isn't performing overly too well as this would be considered **overfitting**. 

In machine learning, overfitting happens when a model gets too focused on the specific details of the training data, and it fails to generalize well to new, unseen data. In other words, the model becomes too *specialized* in the training data and doesn't understand the broader patterns and concepts that are necessary for making accurate predictions.

In [None]:
from sklearn.metrics import r2_score

error = r2_score(y_test, mymodel(X_test))
print(f"The R-Squared Error for this Model is {error}")

Looking above, **R-Squared Error** is a term which essentially evaluates how well the model represented the data and how well our curve fit the data points we supplied it. A score of 0 means our curve didn't fit the data at all, while a score of 1 means that the curve perfectly fit every data point onto our curve. Generally, a score of 0.90 or 90% and above indicates a *very good* fit of the model to our data.

Now, let's input some potential predicted values based on our model. In the code cell below, alter the value of `prediction` which is currently set to **1** and see what the model predicts.

In [None]:
# Input a prediction from 1-365
# For example, instead of setting prediction = 1, you could input prediction = 155
prediction = 1 
print(f"The predicted value for a date of {prediction} is {mymodel(prediction)}")

Using our new model, let's obtain all 365 days of a year and output a figure of what our model would predict certain water levels would be based on the training data we gave it. 

Let's also define another conversion function for our integer dates in order to visualize the plot clearly.

In [None]:
def convert_list_to_dates(day_numbers):
    if any(day_number < 1 or day_number > 366 for day_number in day_numbers):
        return "Invalid day number. Please enter numbers between 1 and 366."

    days_in_month = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    month_names = ["January", "February", "March", "April", "May", "June",
                   "July", "August", "September", "October", "November", "December"]

    dates = []

    for day_number in day_numbers:
        date = None
        month_index = 0

        for days_in_current_month in days_in_month:
            if day_number <= days_in_current_month:
                date = day_number
                break
            else:
                day_number -= days_in_current_month
                month_index += 1

        if date is not None:
            dates.append(f"{month_names[month_index]} {date}")
        else:
            dates.append("Error: Could not determine the date.")

    return dates

In [None]:
predicted_values = []
predicted_dates = []
for i in range(1, 366):
    predicted_values.append(mymodel(i))
    predicted_dates.append(i)
    
# Convert our integer dates to real dates defined from our function above
predicted_dates_converted = convert_list_to_dates(predicted_dates)
X_data_converted = convert_list_to_dates(X_data)

fig = make_subplots(rows=1, cols=2, shared_xaxes=True, subplot_titles=("Actual Water Levels", "Predicted Water Levels"), x_title="Date", y_title="Water Level (m)")

fig.add_trace(
    go.Scatter(x=X_data_converted, y=y_data),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=predicted_dates_converted, y=predicted_values),
    row=1, col=2
)
fig.update_layout(showlegend=False)
fig.show()

### Questions:

1. What are the potential benefits of using machine learning to analyze historical water level data of a lake? How could this information be valuable for managing water resources or planning activities around the lake?
2. How could understanding the patterns and trends in lake water levels through machine learning help in mitigating natural disasters on surrounding ecosystem and communities?
3. How could machine learning models that predict lake water levels be integrated into existing monitoring systems or early warning systems? What are the potential advantages and challenges of such integration?

# Conclusion

The Canadian Government provides information in regard to large bodies of water, known as [hydrometric data](https://wateroffice.ec.gc.ca/mainmenu/real_time_data_index_e.html). In this notebook, we imported this data and identified potential trends apparent in water level data alongside predictions of future water levels through machine learning.

Perhaps you can try extension activities such as utilizing machine learning to different fields of science, such as predicting a classmate's height, or seeing what a house would be priced at based on the size of the house.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)