<img src="images/kiksmeisedwengougent.png" alt="Banner" width="1100">

<div>
    <font color=#690027 markdown="1">   
<h1>HEIGHT OF TREES AND DIMENSIONS OF STOMATA IN THE AMAZON RAINFOREST</h1>    </font>
</div>

<div class="alert alert-box alert-success">
Researchers from Brazil used linear regression to investigate whether there is a relationship between the number of stomata on leaves in the canopy of a tree and the height of the tree. In the previous notebook, you worked with their data and applied linear regression to it yourself. <br>In this notebook, you will add some elements of machine learning to that method. In machine learning, computer scientists reserve a part of the data to assess the quality of the model. Just like them, you will work with training data and testing data in this notebook.</div>

Paleoclimatologists have shown that there is a relationship between the number and size of stomata on leaves and the CO<sub>2</sub> content in the atmosphere when these plants grew.<br>Today, scientists around the world are researching the stomata on today's leaves. <br> In some plants, they discovered differences in the stomata of leaves sprouted in spring versus those in summer. In other plants, they found differences between leaves in the crown of a plant and the shaded leaves at the bottom of the plant.<br>It is certain that the number and size of the stomata are subject to environmental factors.<br> <br>The researchers Camargo and Marenco from Brazil wondered the following:<br>**Is there a correlation between the number of stomata on leaves in the crown of a tree and the height of the tree?**<br>To investigate this, they used data collected in the Amazon rainforest [1].

<div class="alert alert-box alert-warning">
You will notice that the beginning of this notebook runs identically to the previous notebook <em>StomataHoogteBomen</em>. However, there will be a change from point 4. onwards.</div>

### Import necessary modules

In [None]:
import pandas as pd
import matplotlib.pyplot as pltimport numpy as np
from scipy.optimize import curve_fit    # for regressionfrom sklearn.metrics import r2_scorefrom sklearn.metrics import mean_squared_error

<div>
    <font color=#690027 markdown="1">   
<h2>1. Reading the data</h2>    </font>
</div>

In [None]:
amazone = pd.read_csv("data/amazone.csv")

<div>
    <font color=#690027 markdown="1">   
<h2>2. Displaying the read data</h2>    </font>
</div>

In [None]:
# display dataset in tableAmazon

<div>
    <font color=#690027 markdown="1">   
<h2>3. Investigating the linear association between the data via the correlation coefficient</h2>    </font>
</div>

<div class="alert alert-block alert-warning">
More explanation about the correlation coefficient can be found in the notebook 'Standardize'.</div>

Consider the two features you need.

You will work here with the data stored in `x1` and `x3` (`x1` refers to the stomatal densities and `x3` to the heights of the trees). In the graphic representation, `x3` comes on the y-axis and `x1` on the x-axis.

In [None]:
x1 = amazone["stomatal density"]x3 = amazone["tree height"]

Determine the correlation coefficient R for the height of the tree and the density of the stomata. Is there a strong, moderate or weak linear relationship between these characteristics?

In [None]:
# to what extent is there a linear relationship between tree height and stomata density?np.corrcoef(x3, x1)[0,1]

There is a weak linear relationship between the height of the tree and the stomatal density!

<div>
    <font color=#690027 markdown="1">   
<h2>4. Regression line for correlation between stomatal density and tree height</h2>    </font>
</div>

In machine learning, computer scientists reserve a part of the data to check the quality of the model.

<div class="alert alert-box alert-info">
In machine learning, the following approach is taken: the data is split into training data and test data.<br> <em>The training data is used to construct a mathematical model.<br> The test data is used to check whether the model performs well on new data.</em><br>To see how well the model performs, the average squared deviation is calculated.</div>

<div>
    <font color=#690027 markdown="1">   
<h3>4.1 Training data and test data</h3>    </font>
</div>

You will work with **training data and test data**. The training data and test data are standardized separately.

In [None]:
# original datax1 = np.array(x1)x3 = np.array(x3)
# trainingdatax_train = x3[0:29]y_train = x1[0:29]
# testdatax_test = x3[29:]y_test = x1[29:]

<div>
    <font color=#690027 markdown="1">   
<h3>4.2 Standardize</h3>    </font>
</div>

The training and test data are **standardized** as follows: the mean of the training data is subtracted from each datum in the training data, and then the result is divided by the standard deviation of the training data. <br>The full dataset is standardized in the same way. The same is done with the test data: they also use the average and standard deviation of the **training data**.

In [None]:
# standardize
# determine average and standard deviation of training datax_train_avg = np.mean(x_train)x_train_std = np.std(x_train)y_train_avg = np.mean(y_train)y_train_std = np.std(y_train)
# standardize training datax_train = (x_train - x_train_gem) / x_train_stdy_train = (y_train - y_train_mean) / y_train_std
# standardize test datax_test = (x_test - x_train_mean) / x_train_stdy_test = (y_test - y_train_mean) / y_train_std

<div>
    <font color=#690027 markdown="1">   
<h3>4.3 Regression line</h3>    </font>
</div>

The functions needed to determine the regression line are still the same.

In [None]:
# regression line is straight
# input how the equation of a line is constructeddef straight(x, a, b):"""Rule (oblique) straight line with variable x and coefficients a and b."""return a * x + b
# searching for the line that best fits certain data, showing comparison and returning coefficientsdef linreg(x, y):"""Right best fitting with data x and y."""popt, pcov = curve_fit(rechte, x, y)            # curve_fit() checks in def rechte() what the function prescription looks like# curve_fit() returns two things, referred to as popt and pcov    # only need the first, popt, which gives a and b of the required straight line    a, b = popt                                     # coefficients    print("y = ", a, "x +", b)                      # display regression line equationreturn a, b                                     # returns coefficients of regression line equation

The regression line is however only *fitted* on the training data. Afterwards, it is examined how well this regression line fits the test data.

In [None]:
# linear regression# determining regression line based on training dataa, b = linreg(x_train, y_train)y_regression = straight_line(x_train, a, b)
# average squared deviation relative to the training dataprint("Average square deviation for the line with respect to the training data: %.2f"% mean_squared_error(y_train, y_regressie))

In [None]:
# regression line compared to test datay_predicted = straight_line(x_test, a, b)
# average squared deviation relative to the test data (generalization)print("Average squared deviation for the line with respect to the test data: %.2f"% mean_squared_error(y_test, y_predicted))

<div>
    <font color=#690027 markdown="1">   
<h3>4.4 Graph</h3>    </font>
</div>

In [None]:
# estimate rangex_train.min(), x_train.max(), x_test.min(), x_test.max(), y_train.min(), y_train.max(), y_test.min(), y_test.max()

In [None]:
# graphic representationplt.figure(figsize=(10, 8))
plt.xlim(x_train.min()-2, x_train.max()+2)plt.ylim(y_train.min()-2, y_test.max()+2)plt.title("Amazon Rainforest")plt.xlabel("tree height (standardized)")plt.ylabel("stomatal density in mm² (standardized)")
plt.scatter(x_train, y_train, color="green", marker="o")plt.plot(x_train, y_regression, color="red")
# testdataplt.scatter(x_test, y_test, color="blue", marker="o")
plt.show()

Interpretation:The average squared deviation for the line with respect to the training data is 0.93.The mean square deviation for the straight line with respect to the test data is 2.78. This error is larger, so not such a good generalization.<br>

In [None]:
# comparison of straight linesprint("The equation of the line: y =", a, "x +",b)

In [None]:
# comparison of straight lines without standardizingprint("The equation of the line: y =",      a * y_train_std/x_train_std, "x +",      b * y_train_std + y_train_gem - a * x_train_gem *y_train_std/x_train_std)

<div>
    <font color=#690027 markdown="1">   
<h2>4. Outlier</h2>    </font>
</div>

Note that a certain point of the test set may likely be considered an outlier. Take a look at what the generalization is without this point.

<div class="alert alert-block alert-warning">
In the notebooks 'SeaLevelLinearRegression', 'SeaLevelRegression' and 'SeaLevelMLRegression', you will learn how to find a curve that best fits a given set of points, in addition to a straight line. You will also learn about underfitting and overfitting.</div>

<div>
<h2>Reference List</h2></div>

[1] Camargo, Miguel Angelo Branco, & Marenco, Ricardo Antonio. (2011). <br> &nbsp; &nbsp; &nbsp; &nbsp;Density, size and distribution of stomata in 35 rainforest tree species in Central Amazonia. Acta Amazonica, 41(2), 205-212. <br> &nbsp; &nbsp; &nbsp; &nbsp;https://dx.doi.org/10.1590/S0044-59672011000200004 and via email.<br>

<div>
    <h2>With support from</h2></div>

<img src="images/kikssteun.png" alt="Banner" width="1100"/>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook KIKS, see <a href="http://www.aiopschool.be">AI At School</a>, by F. Wyffels & N. Gesquière is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.