<img src="images/kiksmeisedwengougent.png" alt="Banner" width="1100"/>

<div>
    <font color=#690027 markdown="1"> 
        <h1>SEA LEVEL IN OSTEND - REGRESSION</h1>
    </font>
</div>

<div class="alert alert-box alert-success">
In this notebook, we are investigating whether the future sea level in Ostend can be better predicted by using <em>regression with a parabola instead of with a straight line, or even better with another curve</em>.<br>The functionalities of the Python module <em>SciPy</em> are being used again. <br>The importance of <em>standardizing</em> is clarified and the phenomenon of <em>overfitting</em> is illustrated.</div>

<div class="alert alert-box alert-warning">
This notebook follows the notebook 'Sea level in Ostend - Linear regression'.<br>In the notebook 'Linear Regression,' it is explained how to determine a regression line for given data using the SciPy module.</div>

Sea level is influenced by, among other things, the increase in volume of the water mass with rising temperature, melting ice caps and glaciers, changing storage of surface water and groundwater. Global climate change is expected to lead to a rise in sea level of 18 to 59 cm [1].<br>
**We examine the evolution of the sea level at the Belgian coast since 1951. Ostend is the measuring point on our coast with the longest uninterrupted series of measurements.**

The height of a point is measured relative to sea level. <br>The sea level, however, does not always remain at the same height. Tides cause a difference that amounts to about four meters at the Belgian coast. <br>Therefore, a reference point is needed. The average sea level at low tide in Ostend is used as the baseline: the Second General Levelling (TAW). In the Netherlands, the average sea level between low and high tide is taken: the Normal Amsterdam Level (NAP). The TAW reference point lies 2.33 meters below the NAP reference point. To compare national height measurements with each other, one must take into account the different reference points [2].<br><br>**Sea level is expressed in mm RLR (Revised Local Reference); data relative to the local reference are converted to data relative to the international reference level.**

The sea level in Ostend has been measured since 1951. The values from these measurements can be found on the website of the Flemish Environment Agency [3]. <br><br>For this notebook, the data are available in the file `sealevel.csv` in the `data` folder. <br>The data contain the annual average sea level (in mm RLR) in Ostend for a specific year and the year itself.

### Assignment 1
- Visualize the data from the csv file (see also the previous notebook).<br>Set the range of the axes in such a way that there is space for a 'glimpse into the future'.
- The scatter plot shows a trend. Determine the equation of a parabola as a trend line.
- Draw them on the graph.
- Do you find the parabola suitable as a trend line?
- The annual average sea level measured in 2018 and 2019 amounts to resp. 7067 and 7129 mm RLR [4]. Add this data to the graph.
- Is your opinion on 'Do you think the parabola is suitable as a trend line?' still the same?

### Assignment 2
- Determine a cubic or tenth degree polynomial as a trend line.
- Draw them on the graph.
- Are these curves suitable as a trend line?

## Solution

### Import Required Modules

## Solution to assignment 1

<div>
    <font color=#690027 markdown="1"> 
        <h2>1. Reading the data</h2> 
    </font>
</div>

<div>
    <font color=#690027 markdown="1"> 
        <h2>2. Point Cloud</h2>
    </font>
</div>

<div>
    <font color=#690027 markdown="1"> 
        <h2>3. Quadratic Regression</h2> 
    </font>
</div>

Look carefully at what you did to obtain a straight line as a regression line and adjust that code for a parabola.

## Solution for assignment 2: Another curve as regression line

<div>
    <font color=#690027 markdown="1"> 
        <h2>4. Third-Degree Curve as Regression Line</h2>
    </font>
</div>

If we apply the same method as above with the parabola, then `curve_fit()` will find a curve whose graph falls outside the range of the graph screen. Try it yourself! <br>This is because the calculations go awry due to the large values in the dataset. Therefore, the data are standardized.

<div class="alert alert-box alert-warning">
Review the notebook 'Standardization' from the learning path 'Linear regression' of 'Python in Mathematics class'.</div>

The data is **standardized** as follows: from each data point in the training data, the average is subtracted and then the result is divided by the standard deviation. In other words, the Z-score is calculated for all the training data. <br> This way, most of the data falls between -1 and 1. <br>Note that the full dataset is standardized in the same way. So, one does exactly the same with the test data: one also uses the mean and standard deviation of the **training data**.

<div class="alert alert-box alert-info">
Calculating with fairly large numbers quickly leads to even larger numbers and to numerical instability, which is one of the reasons why data are standardized. Standardizing variables means that you rescale the variables in such a way that you can compare or relate variables of e.g. a different magnitude or in a different unit. The correlation between bivariate data, for example, can be visually estimated by looking at the corresponding scatter plot. However, the shape of this is only reliable if the data are standardized. Also, some machine learning algorithms are only usable once the data are standardized, because those algorithms are designed that way.    </div>

In [None]:
# standardize
x_std = (x-np.mean(x))/np.std(x)
y_std = (y-np.mean(y))/np.std(y)

In [None]:
print(x_std, y_std)

Now a regression line is determined that fits the standardized values.
Examine the code carefully. Do you understand what each instruction does?

In [None]:
def derdegr(x, y):
    """Prescription of cubic polynomial function with parameters a, b, c, and d.""" 
    return a * x**3 + b * x**2 + c * x + d

def derdegraadsreg(x, y):
    popt, pcov = curve_fit(derdegr, x, y)
    a, b, c, d = popt
    print("y = ", a, " x³  +", b, "x² +", c, "x +", d)
    return a, b, c, d

In [None]:
a, b, c, d = derdegraadsreg(x_std, y_std)

In [None]:
x_std.min(), y_std.min(), x_std.max(), y_std.max()

In [None]:
x_std_regressielijn = np.arange(-2, 3, 0.1)
y3_std_regressielijn = derdegr(x_std_regressielijn, a, b, c, d)

In [None]:
print(x_regressielijn)
print(y3_regressielijn)

In [None]:
plt.figure(figsize=(15,12)) 

# choose range so that suitable for a view of the future
plt.xlim(x_std.min()-2, x_std.max()+2)
plt.ylim(y_std.min()-2, y_std.max()+2)
plt.title("annual average sea level in Ostend")
plt.xlabel("year")
plt.ylabel("sea level in mm RLR")

plt.scatter(x_std, y_std, color="blue", marker="o")
plt.plot(x_std_regressielijn, y3_std_regressielijn, color="orange")
plt.plot((2018-np.mean(x))/np.std(x), (7067-np.mean(y))/np.std(y), color="magenta", marker="o")
plt.plot((2019-np.mean(x))/np.std(x), (7129-np.mean(y))/np.std(y), color="magenta", marker="o")

plt.show()

Answer:

<div>
    <font color=#690027 markdown="1"> 
        <h2>5. Tenth degree curve as regression line</h2> 
    </font>
</div>

In [None]:
def tiendegr(x, a, b, c, d, e, f, g, h, i, j, k):
    """Prescription of a polynomial function of tenth degree with parameters a, b, c, and d."""    
    return a * x**10 + b * x**9 + c * x**8 + d * x**7 + e * x**6 + f * x**5 + g * x**4 + h * x**3 + i * x**2 + j * x + k

def tiendegraadsreg(x, y):
    popt, pcov = curve_fit(tiendegr, x, y)
    a, b, c, d, e, f, g, h, i, j, k = popt
    return a, b, c, d, e, f, g, h, i, j, k

Fill in the rest of the code yourself. Base your work on the code for a cubic curve and adjust that code.

Answer:

<div class="alert alert-box alert-info">
It is said that the curve <b>overfits</b>. <br>The more factors one takes into account, the better the curve will fit the data. However, overfitting occurs when one also takes into account characteristics of the data that are not relevant to the problem to be solved.    </div>

<div>
    <h2>Reference List</h2> 
</div>

[1] Flemish Environment Agency (2019). Climate Change. Consulted on January 21, 2020 via <br> &nbsp; &nbsp; &nbsp; &nbsp; https://www.milieurapport.be/milieuthemas/klimaatverandering<br>[2] Frank Deboosere. (2010). With respect to which reference point are altitude measurements for maps made?<br> &nbsp; &nbsp; &nbsp; &nbsp; Accessed on January 21, 2020 via https://www.frankdeboosere.be/vragen/vraag72.php <br>[3] Flemish Environment Agency (2019). Sea level. Consulted on January 21, 2020 via <br> &nbsp; &nbsp; &nbsp; &nbsp; https://www.milieurapport.be/milieuthemas/klimaatverandering/zeeklimaat/zeeniveau/zeeniveau <br>[4] Flemish Environment Agency (2021). Sea Level. Accessed on November 12, 2021 via <br> &nbsp; &nbsp; &nbsp; &nbsp;https://www.milieurapport.be/milieuthemas/klimaatverandering/zeeklimaat/zeeniveau

<div>
    <h2>With support from</h2></div>

<img src="images/kikssteun.png" alt="Banner" width="800"/>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook KIKS, see <a href="http://www.aiopschool.be">AI At School</a>, by F. wyffels & N. Gesquière, is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.